Complex Task Monitoring

THE TRANSFER OF INFERRED VS.
SIGNALED FAILURE DETECTION

PERFORMANCE IN MONITORS AND CONTROLLERS
OF A COMPLEX DYNAMIC TASK
BY
GRANT E. YOUNG
A Dissertation submitted to the Graduate School in

partial fulfillment of the requirements
for the Degree
Doctor of Philosophy
Major Subject: Psychology

New Mexico State University
Las Cruces, New Mexico
August 1997
ii
VITA
November 29, 1965 -- Born in Oakland, California

1994 -- MA, New Mexico State University, Las Cruces, New Mexico
1988 -- BS, Denison University, Granville, Ohio
1989 - 1992 -- Hewlett-Packard Co., Palo Alto, California
PROFESSIONAL SOCIETIES
Human Factors and Ergonomics Society, National Chapter

Human Factors and Ergonomics Society, N.M.S.U Student Chapter, President 94/95
PUBLICATIONS
Young, G. E. (1995). The Impact of Trial Length and Mode Experience on Failure-
Detection Performance in Monitored and Controlled Dynamic Tasks. Proceedings of
the Eighth International Symposium on Aviation Psychology, 1031-1036.
Berringer, D. B., Allen, R. C.. Kozak, K. A., & Young, G. E. (1993). Responses of Pilots
and Non-pilots to Color-coded Altitude Information in a Cockpit Display of Traffic
Information. Proceedings of the Human Factors and Ergonomics Society 37th Annual
Meeting, 84-87.
FIELD OF STUDY
Major Field: Experimental Psychology

Engineering and Aviation Psychology, Human-Computer Interaction
ii
ABSTRACT
THE TRANSFER OF INFERRED VS. SIGNALED FAILURE DETECTION

PERFORMANCE IN MONITORS AND CONTROLLERS
OF A COMPLEX DYNAMIC TASK
BY
GRANT E. YOUNG
Doctor of Philosophy in Psychology

New Mexico State University
Las Cruces, New Mexico, 1997
Dr. James E. McDonald, Chair
Previous research has shown that active controllers can detect failures in a simple dynamic
system faster and more accurately than passive monitors. Further, when controllers transfer
to a monitoring task, they also have better failure detection performance than subjects who
only monitor. This dissertation has two objectives: (a) to replicate previous tracking-task
based findings using a new, cognitively complex dynamic task with failure types which tap
into different cognitive processes, and (b) to use this new task paradigm in an ecologically
valid experimental design to further explore the demonstrated advantages of controlling.
Further, this dissertation advances the contention that the controller/monitor issue should
be conceptualized as a difference in the level of activation of the operator’s mental model of
the system. Results from Experiment 1 fail to replicate past findings of a controller
advantage, but yield the surprising result that past controllers may scan the display more
effectively. Experiment 2 improves upon the basic design of Experiment 1 and makes it
possible to explore the issue of controller versus monitor scan differences in greater depth.
Experiment 2 successfully replicated the controller advantage observed in tracking-task
experiments, and supports the conclusion of Experiment 1 that controllers scan the display
more effectively and use the information gained to their advantage. Experiment 3 uses the
same experimental paradigm, but in a design more representative of operational settings.
All subjects in Experiment 3 learned in a controlling mode and then transferred to the
iii
monitoring task. However, subjects were periodically reintroduced to the controlling mode
and its effects on their subsequent monitoring performance were measured. Results
demonstrate that controller reintroduction has a positive effect on monitoring performance.
Implications of these findings for operational environments are discussed in detail.
iv
TABLE OF CONTENTS
LIST OF FIGURES.......................................................................................................... viii
INTRODUCTION ............................................................................................................... 1
Pros and Cons of Automation ....................................................................... 3
Advantages of Automation............................................................................ 4
Problems with Cockpit Automation............................................................... 8
The Role of the Vigilance Decrement .......................................................... 10
Peripheralisation ......................................................................................... 13
Loss of Motor Skills ................................................................................... 15
Reduction of Small Errors at the Cost of Occasional Large Errors............... 16
Are Multi-modal FMSs too Complex?......................................................... 18
Is Workload Lower, or Just Different? ........................................................ 25
Automation Induced Complacency.............................................................. 27
Situational Awareness................................................................................. 30
Mental Models............................................................................................ 33
Relevant Research....................................................................................... 39
Paradigm History ........................................................................................ 40
Present Research......................................................................................... 44
Experimental Hypotheses, Experiment 1 ..................................................... 47
EXPERIMENT 1............................................................................................................... 49
Method ....................................................................................................... 49
Subjects ........................................................................................................................ 49
v
Apparatus ..................................................................................................................... 49
Task ............................................................................................................................. 49
Experimental Design .................................................................................................... 52
Training ....................................................................................................................... 53
Results........................................................................................................ 54
Signaled Failures .......................................................................................................... 56
Inferential Failures ....................................................................................................... 57
Discussion .................................................................................................. 58
Experiment 1 Conclusions........................................................................... 64
Implications for Experiment 2 ..................................................................... 65
Experimental Hypotheses, Experiment 2 ..................................................... 68
EXPERIMENT 2............................................................................................................... 69
Method ....................................................................................................... 69
Subjects ........................................................................................................................ 69
Task ............................................................................................................................. 69
Experimental Design .................................................................................................... 70
Training ....................................................................................................................... 71
Results........................................................................................................ 71
Signaled Failures, Session 1.......................................................................................... 72
Signaled Failures, Session 2.......................................................................................... 72
Inferential Failures, Session 1....................................................................................... 73
Inferential Failures, Session 2....................................................................................... 73
Discussion .................................................................................................. 74
Conclusions and Implications ..................................................................... 80
Experiment 3 Experimental Hypotheses ...................................................... 83
EXPERIMENT 3............................................................................................................... 85
Method ....................................................................................................... 85
vi
Subjects ........................................................................................................................ 85
Task ............................................................................................................................. 85
Experimental Design Considerations ............................................................................ 85
Experiment 3 Design .................................................................................................... 88
Training ....................................................................................................................... 91
Results........................................................................................................ 92
Signaled Failures .......................................................................................................... 92
Inferential Failures ....................................................................................................... 93
Day 6 Controllers - Separate Analysis........................................................................... 94
Discussion .................................................................................................. 95
Signaled Failures......................................................................................... 96
Inferential Failures .................................................................................... 100
Conclusion................................................................................................ 104
REFERENCES ............................................................................................................... 109
Appendix A: Experimental task...................................................................................... 116
Appendix B: Experiment 1 Inferred failure RT and error rate........................................ 117
Appendix C: Experiment 2 Inferred failure RT and error rate. ....................................... 118
Appendix D: Experiment 3 Inferred failure RT and error rate. ....................................... 119
vii
LIST OF FIGURES
Figure 1. Experimental design, Experiment 1. Session 2 counterbalanced

by condition.................................................................................... 53
Figure 2. Experiment 1, Signaled Failure RT. ....................................... 56
Figure 3. Session 2 Inferential Failure detection performance (combined

index). ............................................................................................ 58
Figure 4. Experimental design, Experiment 2, Session 2 counterbalanced

by condition.................................................................................... 71
Figure 5. Session 2 Signaled Failure RT. .............................................. 73
Figure 6. Session 2 Inferential Failure detection performance (combined

index). ............................................................................................ 74
Figure 7. Inferred and Signaled Failures by Trials, Days 4 - 7. .............. 87
Figure 8. Inferred and Signaled Failures, Days 1 - 9.............................. 87
Figure 9. Experiment 3 experimental design, participatory mode, counter-

balanced on Days 5 and 6. Comparison trials outlined in bold......... 89
Figure 10. Experiment 3 experimental design, failure occurrences by

failure type, randomized by subject on Days 2 - 4. Comparison trials
outlined in bold............................................................................... 91
Figure 11. Signaled failure detection performance, Trials 4 and 5,

Conditions Post-Control (PC) and Post-Monitor (PM). .................. 93
Figure 12. Inverse relationship of Inferred and Signaled failure detection

performance, Trials 5, Conditions Post-Control and Post-Monitor. . 94
viii
ix
I know I’m not in the loop, but I’m not exactly out of the loop. It’s more
like I’m flying alongside the loop.
-Anonymous Boeing 767 Captain, (Wiener, 1988)
INTRODUCTION
The interaction of pilots and highly automated aircraft has become an increasingly studied
topic in recent years. This interest has been fueled not by the pursuit of academic
enlightenment, but rather by a series of fatal aircraft accidents in which the pilot/auto-pilot
interaction was the primary cause (Billings, 1991). I believe that the origin of this
precarious order of events is the result of industry, consumers, and safety advocates alike
embracing a promising technology. While some skeptics presented a view that the
pilot/auto-pilot relationship may not perform at the consonant level anticipated, this
perspective became overwhelmed by the multiple advantages promised by automation. In
fact, despite the concerns over cockpit automation, most would agree that an aircraft with a
high level of automation is more efficient, and possibly even safer than a similar aircraft
without it (Billings, 1991). Although accidents per passenger mile continue on a downward
trend, numerous examples of a new pilot-automation interaction problem are showing up in
the probable cause section of accident reports. The problem, however, is not that
automation exists in modern aircraft, but rather that a lack of foresight has yielded systems
and procedures intolerant of certain predictable, yet unavoidable errors.
Automation, when used in the aviation domain, is a broad term. It does not refer to a single
device, but rather a class of devices which control the various dynamic processes in an
1
aircraft ranging from basic mechanical systems to the actual task of “flying” the aircraft.
For purposes of clarity, when “automation” is used in this paper, it refers to the definition
used by Billings (1991): “A system in which many of the processes of production are
automatically performed or controlled by self-operating machines, electronic devices, etc.”
As will be discussed in greater detail, the level of automation in aircraft has been creeping
across the automation continuum for the last 80 years, and has only in the last several
decades become so prominent in the cockpit that it has raised serious concerns.
The question of whether or not to automate civilian transport and military aircraft of all
types is now merely academic (Billings, 1991). Prior to the 1950s the question was “what
can we automate?” The evolution of microelectronics and microprocessors quickly changed
the question to “what should we automate first?” By the early seventies, such questions
were rarely asked as virtually every component of the cockpit had become, or was on its
way to becoming, highly automated. It has only been since the late 1980s that the question
of “what and how much” to automate has again become an important and serious question
in the design of commercial aircraft. Longtime proponents of automation presently
acknowledge that automation must be more ”intelligent” if it is to achieve the levels of
safety initially anticipated (Van Cott, Wiener, Wickens, Blackman, & Sheridan, 1996),
while skeptics of automation continue to believe that multi-modal automation may simply
be too complex to be managed by human pilots (Sarter & Woods, 1993).
While the future of automation seems to be progressing toward the integration of increased
levels of “intelligence” combined with clearer operational features, a simpler pilot-
automation interface, and appropriate “human centered” operational procedures,
fundamental arguments for and against different implementations of automation will be
discussed in detail in the following section. Proceeding that discussion, the various
2
experimental paradigms and research perspectives which have focused both directly and
indirectly on the issue of cockpit automation will be discussed in detail. These include
vigilance research, peripheralisation, motor-skill factors, the small error/large error trade-
off, automation reliability, flight management system mode complexity, workload levels,
automation induced complacency, situational awareness, and the role of mental models as a
framework for conceptualizing problems with automation. Finally, specific past research
relevant to the present research perspective will be discussed, along with a discussion of
how the present research complements other research in this field.
Pros and Cons of Automation
One need only briefly review the history of both commercial and military aviation to
appreciate the desire of many to reduce the level of human operator control in the cockpit.
Although difficult to quantify, best estimates put the direct contribution of human error to
commercial aviation disasters at approximately 70% (Nagel, 1988), while the role of some
human error as a contributing factor in the chain of events leading to an accident is likely
even higher. This of course does not include fatal mishaps on railroads, ships, automobiles,
and industrial applications, but the percentages are likely similar (Van Cott, Wiener,
Wickens, Blackman, & Sheridan, 1996). Although human error is a complex concept, it
can generally be broken down into the following categories (Woodson, 1981):
1) Perceptual errors: Searching for and receiving information, and identifying objects,
actions and events.
2) Mediational errors: Failure to process information, solve problems or make decisions
correctly.
3) Communication errors: Failure of communication between crew members, crew to Air
Traffic Control, and trainers and manufacturers to crew members.
3
4) Motor errors: Failure to execute simple and complex, discrete and continuous, motor
behaviors correctly.
Automation has, interestingly, in its own duplicitous manner both addressed and
aggravated human error in each of these categories. The purpose of the following
discussion is to explore in detail both the pros and cons of cockpit automation, and the
complex way that it interacts with human error. In fact, this discussion will highlight the
observation that technology has the potential to solve problems while at the same time
introducing new problems.
Advantages of Automation
Ample studies of human error in the cockpit have come to the conclusion that a primary,
yet not exclusive cause of human error is excessive workload (Kantowitz & Sorkin, 1983).
Prior to the use of automation in the cockpit, pilots were forced to attend to and manage the
many complex systems in the aircraft (e.g., fuel distribution, engine management, cabin
pressurization, etc.) and fly the aircraft (e.g., manual control, navigation, and
communication). While the first generation of automation included two dimensional
aircraft control and simplified radio navigation, the second generation of aircraft
automation included the consolidation of displays into integrated displays, the transition
from raw data into more usable command information (e.g., a flight director), and the use
of air data computers to integrate multiple sources of information regarding air density and
direction into usable information both for the pilot and auto-pilot (Billings, 1991). Today,
the third generation of automation sees all complex systems in the aircraft partially or
completely automated (Satchell, 1993), with an emphasis on the integrated management of
all the automated devices on the aircraft (Billings, 1991). If they choose, pilots need only
4
be involved in commanding the automation through the Flight Management Computer,
becoming intimately involved in aircraft control only if a problem or unusual circumstance
occurs. It is for this reason that the majority of aircraft produced today are designed for two
pilots, as compared to four (two pilots, plus a flight engineer and navigator) which was the
case only thirty years ago. In fact, few would argue that the application of automation in
systems management has not been a tremendous success.
Because second generation sub-system management was relatively mundane and
straightforward, the automation was relatively simple and its execution relatively error free.
Although such automation eventually eliminated the need for a flight engineer altogether, it
did relatively little to ease the workload of the flying pilots since much of the system’s
automation replaced the flight engineer’s duties, but not necessarily the pilots’. Although
flying an aircraft under cruise condition requires relatively low levels of workload, getting
the aircraft from the ground to cruise, and then from cruise to the ground requires
considerable effort on the part of the pilots (Billings, 1991). In addition, a statistical
breakdown of aircraft accidents demonstrates convincingly that these portions of the flight
contain the greatest risk. In fact, 90% of all accidents occur in the climb to, or descent from
cruising condition (Nagel, 1988). This statistic is made more profound by the fact that
these two phases of flight account for less than 40% of the flight time (Nagel, 1988). In
fact, accidents during cruise account for less than 9% of all aircraft accidents, yet cruise
flight accounts for 60% of flight time. Not only must pilots communicate with air traffic
control (ATC) and navigate to the correct location, but they must maintain control of the
aircraft in the desired attitude, altitude, vector, and velocity. Although this task is not
difficult for the seasoned pilot, it nonetheless requires considerable attention to be
accomplished effectively. Workload requirements can be further increased in this phase of
5
flight due to the presence of hostile weather, aircraft traffic and frequent ATC
requirements, or by systems problems (Nagel, 1988).
Given the complexity of the flying task in these phases of flight, the increased risk of
mishap, and the strong correlation between pilot workload and pilot error (Kantowitz &
Sorkin, 1983), tremendous effort was put into the design of Flight Management Systems
(FMS) which, through computerized command of navigation and aircraft control, ease the
burden of controlling the aircraft in these critical phases of flight. Such automation, in
theory, eliminates many time- and resource-consuming tasks which contribute to pilot
workload. Advocates of automation emphasized that by reducing a pilot's workload,
attention could be directed to monitoring mission progress and overall system status, rather
than burdening a pilot's cognitive resources with command and control processing.
Further, should a partial or complete failure occur with the automation, the pilot could
quickly and effectively diagnose and re-engage at the point where the auto-pilot
relinquished authority.
If one were to tour the cockpit of a modern airliner, one would find a Flight Management
System (FMS) which not only has the capacity to successfully control and navigate the
aircraft through descent and ascent, but can fly the aircraft from takeoff to taxi at the
destination without a single pilot intervention of the aircraft controls (Billings, 1991). In
fact, until recently, it had been the operational policy of air carriers to encourage their pilots
to use their FMS to its fullest capacity, leaving the pilots with the duty of high level
management of the flight (Billings, 1991).
The other compelling reason for the development of the FMS besides the belief that the
human’s limited capacity for workload was the primary barrier to aircraft safety was the
6
acknowledgment that human inner loop control precision was very limited (Billings, 1991).
Not only is the task of precise control tedious and perceptually demanding, but the high
control error levels of human operators mean considerable loss in efficiency. In fact,
microprocessor control of flight allows all flight phases and transitions to be accomplished
at maximal efficiency. Not only do human pilots lack the specific knowledge of how to
execute control maneuvers with perfect efficiency but, even with this knowledge, their
control accuracy is inadequate. The use of Flight Management Systems has thus introduced
far greater efficiency to the aircraft system, regardless of changes in aerodynamics or
engine efficiency. In fact, Covey et al. (1979) suggested that a 12% savings in fuel
consumption could be achieved by optimizing operational efficiency (not including physical
changes to aircraft systems) with much of this gain coming through the use of automation.
Another study cited by Wiener and Curry (1980) suggested that a three percent reduction in
fuel consumption could result in a 26% increase in airline profits. Fueling this drive
toward efficiency was also the fact that the price of a gallon jet fuel went from 38 cents in
1978 to 70 cents in 1979 (Wiener & Curry, 1980), and went above a dollar in the 1980s
where it remains today. Improved efficiency clearly increases the profitability of airlines,
increases the need for new aircraft, lowers ticket prices, and reduces environmental impact.
Further, by lowering the cost of flying to the general public, overall transportation safety is
theoretically enhanced by moving people into air travel and away from more dangerous
forms of personal transportation.
The final reason for the push towards automation was the need to address specific human
error induced safety concerns such as controlled flight into terrain and air-to-air collisions.
In accidents such as these, it was often clear that sufficient information was present so that
given prompt and accurate interpretation of the information, such disasters could be
7
avoided (Billings, 1991). Computerized systems were thus developed to deal effectively
with these specific safety hazards.
A good example of this automation is the Ground Proximity Warning System mandated by
Congress in 1975 to address a series of “controlled flight into terrain” incidents (Wiener &
Curry, 1981). This simple form of automation combines radar and barometric altimetry to
calculate height above ground and rate of change, therefore predicting when a possible
unintended conflict with the ground might occur (Billings, 1991). Such automation is
advisory only, thus leaving ultimate command authority to the pilot. Other examples of
“problem specific” automation include devices which force the control column of an aircraft
forward (known as a “stick pusher”) to avert an aerodynamic “stall,” and the Traffic Alert
Collision Avoidance System (TACAS) which receives transponder signals from other
aircraft and displays them in relation to one’s own aircraft thus warning of potential
conflict. There are many other such systems in modern aircraft and most would agree that
current “problem specific” automation has been quite successful, despite the common
appearance of problems when these systems are first instituted (Billings, 1991).
Problems with Cockpit Automation
Although the previous section highlighted the positive aspects of increased automation, this
evolution has festered considerable controversy and numerous disasters. Some highly
visible incidents have illuminated the fact that the transition to automation has not been
problem free. I believe that careful analysis of the problems with aircraft automation show
it to be a combination of traditional human performance problems seen in other man-
machine systems, combined with other unpredicted problems discovered through the
8
analysis of accidents and incidents, data from the Aircraft Safety Reporting System (ASRS)
and simulator studies. The following is a list of those factors:
1. Given ample evidence of the poor monitoring ability of humans, can pilots be trusted to
monitor complex systems for long durations with adequate vigilance?
2. Does automation cause “peripheralisation” as a result of removing pilots from the
control loop, thus reducing their effectiveness as human operators?
3. Does automation cause a degradation of motor skills which will impair pilots when, by
choice or by force, they are required to re-enter the control loop?
4. Does automation eliminate frequent small human errors but give way to infrequent
serious errors?
5. Can automation be designed so that it is as reliable as advocates predict, and intelligent
enough to circumvent typical “dumb” computer errors?
6. Can human pilots effectively program and monitor the “multi-modal” Flight
Management Systems which have many programmable flight modes, and can switch modes
based on internal factors and not pilot input?
7. Does automation really reduce pilot workload, or has the workload remained the same
but transferred from “flying the plane” to “programming the computer?”
8. Does the reliability of automated systems cause complacency in the cockpit which has an
adverse effect on the efficiency of humans as monitors?
9. Does continuous monitoring by pilots cause a loss of “situation awareness” which could
adversely affect their monitoring performance?
10. Do pilots have “mental models” of the flying task which may be adversely affected by
“flying the automation” rather than flying the aircraft, and thus preventing effective system
monitoring?
9
This list of potential problems with automation are clearly not independent, and any
theorized or observed problems with the pilot-automation interface is likely causally related
to several of these factors. The following section discusses each of these factors in greater
detail.
The Role of the Vigilance Decrement
Research on the ability of individuals to maintain effective sustained attention of real time
processes was first studied in earnest in World War II (Parasuraman, 1986), although some
concern can be traced to early questions about inspectors’ abilities to detect assembly line
defects (Wiener, 1984). The advent of radar produced the need for human operators to
monitor this new technology and efficiently detect enemy threats in a highly monotonous
task with few signals. Through a series of field experiments on both sides of the Atlantic, it
quickly became clear that the fragile nature of human monitoring performance meant that
normal working schedules were inappropriate for sustained attention tasks (Wiener, 1984).
In fact, early field studies by the RAF Coastal Command suggested that radar monitoring
performance declined after approximately 30 minutes (Parasuraman, 1986), while a team
from the United States suggested that monitoring periods not exceed 40 minutes (Wiener,
1984). In fact, research led by Mackworth (1950) verified in the laboratory what the
military had observed in the field and initiated the term “vigilance decrement” to describe
this phenomenon.
The vigilance decrement referred to the fact that after a given period of sustained attention,
human operators lost their ability to effectively discriminate between signals which they
otherwise could (Parasuraman, 1986). Although the causes of the vigilance decrement are
complex and still debated, its existence in simple sustained attention tasks is unquestioned.
10
Fortunately, remedies for this problem were unusually simple. Once the onset of the
vigilance decrement was established, a work shift routine was designed so that no operator
was monitoring past the period of full vigilance, and each operator was given enough of a
break so that full vigilance was restored.
The success in early vigilance research was due in large part to the fact that the actual task
of observing radar displays lent itself well to the design of experimental tasks for laboratory
research (Parasuraman, 1986), although a minority criticized the early research for its
artificially high signal rates (Wiener, 1984). This convenient and uncommon
circumstance, combined with parallel findings in field research, meant that the early
vigilance research was widely accepted and did not suffer the typical validity issues
associated with laboratory research. This was not the case, however, when researchers tried
to apply vigilance research findings to other, usually more complex, sustained attention
tasks. Not only was the majority of vigilance research conducted in the laboratory focused
on extremely simple, low arousal tasks, but when more complex paradigms were used, the
results were often equally complex (Parasuraman, 1986).
Early vigilance research in which more complex paradigms were used sometimes found
little or no vigilance decrement, leading many to view the vigilance decrement as a
laboratory phenomenon not applicable to complex real world tasks (Parasuraman, 1986).
One predominant view of complex task vigilance was based on research conducted by
Adams et al. (1961). This research used a simulated air defense task in which the number
of non-signal targets was either 6 or 36. Although overall detection performance was worse
when the non-signal targets were more abundant, performance did not change with
increases in time spent at the task. This finding led Adams et al. (1962) and others (as
cited in Parasuraman, 1986) to believe that complex tasks yielded sufficient arousal to
11
prevent a vigilance decrement. Still other studies, however, demonstrated a strong
vigilance decrement. These studies include a three-clock version of the Mackworth clock
test and multi-channel auditory monitoring tasks (Parasuraman, 1986).
Several theories have been offered for the disparity of results in complex task vigilance
research. Most importantly, the tasks and procedures vary widely and thus make it difficult
to classify tasks along a continuum of “task complexity” (Parasuraman, 1986). Further,
Adams et al. (1961) suggested that because of large individual differences in complex task
performance, slight vigilance decrements may exist, but fail to reach statistical significance
in laboratory settings. Supporting this, Parasuraman (1976) found that between-subject
variability in detection rate for a dual-source visual discrimination task was nearly twice
that for a single source task. Another explanation contends that since complex task
performance is already poor, there is little opportunity for it to get worse with time (Davies
& Tune, 1969). Additionally, it has been proposed that when a complex-task vigilance
decrement exists, it may be only a slight decrement in sensitivity, thus having only slight
effects on detection rate (Parasuraman, 1986).
Only a brief review of complex task vigilance research is necessary to appreciate the
historical difficulty in finding a consistent and robust vigilance decrement in complex tasks.
Although some may see this as an implication of a weak if not irrelevant phenomenon,
others have used these diverse findings to support the contention that a vigilance decrement
does exist in complex tasks, but it is difficult to capture experimentally (Parasuraman,
1986). Regardless, however, there is little doubt that the diversity of complex tasks in the
operational environment make it extremely difficult to design research tasks ecologically
valid enough to generalize findings to the operational environment. Thus, it has been
proposed that any theoretical findings from laboratory settings must be conducted in
12
parallel highly realistic paradigms to insure ecological validity (Satchell, 1993). While an
undertaking such as this may be unrealistic, it would likely quell some of the debate over
the existence of the vigilance decrement in complex tasks.
Peripheralisation
The term peripheralisation has been used to describe the process of roll change that pilots
experience as they become increasingly distanced from the essential flight process as levels
of automation increase (Billings, 1991; Norman, Billings, Nagel, Palmer, Wiener, &
Woods, 1988; Satchell, 1993). Satchell (1993) has described peripheralisation as a
“complex psycho-biological state which occurs as a consequence of automation.” This
peripheralisation process stems partially from the failure of aircraft designers to focus on
human needs in an “out of the loop” control environment (Wiener & Curry, 1980) but it
may also be an inescapable consequence of the automation process.
Satchell (1993) has organized the effects of peripheralisation into the following three
categories, some of which will be discussed in greater detail elsewhere in this paper:
1. Complacency: A “self-satisfaction which may result in non-vigilance based on an
unjustified assumption of satisfactory system state.” Key in the notion of complacency is
the consistency and reliability of automation, both of which have been shown to affect
monitoring performance (Parasuraman, 1993).
-Primary/secondary task inversion: A behavioral phenomenon in which a backup alerting
system, for example an altitude alerting system, becomes the primary information source for
the operators. Such task inversions usually result in altered operator monitoring behavior.
13
-Automation Deficit: The temporary and relative reduction in manual performance upon
resuming a task which has been previously automated.
-Boredom-panic syndrome: The behavioral phenomenon in which continuous monitoring
of automation leads to boredom, which then renders the operator handicapped to
sufficiently deal with a suddenly increased workload. An example of this is the high
workload levels often encountered below 10,000 ft. by cockpit crews following extended
monitoring during cruise flight.
2. Communication: Research into aircraft accidents has generated many examples that
show a strong relationship between peripheralisation and ineffective communication.
Flight crews who communicate effectively have been shown to communicate more
frequently, openly, directly and concisely compared to ineffective crews. However, studies
comparing crews in aircraft with different automation levels show that as level of
automation goes up, quality of crew communication goes down.
3. Situational awareness: “the accurate perception of the factors and conditions that affect
an aircraft and its flight crew during a defined period of time.”
-The Big Picture: Although related to situational awareness, the big picture refers to
awareness of the state of the system at a global level. An example of this would be the
China Air Lines crew who let their 747 stall and enter a spin while they attended to an
engine problem.
-Information acquisition: Automation can adversely affect information acquisition since
system automation often translates, interprets, and integrates raw data prior to being
presented on the pilot/system interface. Although this translation of raw data is often
advantageous, it can have some peripheralising consequences.
The concept of peripheralisation is quite useful as a construct in understanding the
changing role of human operators as automation increases. If a system designer’s goal is to
14
lower mental workload by reducing the amount of raw data received by the pilot, then some
peripheralisation may be acceptable. If, however, the automation of navigation
peripheralizes the crew to the degree that they no longer attend to the navigation of the
aircraft, a devastating result may occur should the automation fail or be misdirected by the
pilots.
Loss of Motor Skills
The loss of motor skill as a result of lack of practice is a major concern accompanying
increased automation in the cockpit (Endsley, 1995). Not only is the problem salient and
fairly well studied, but it has been a frequently reported concern of pilots of automated
aircraft (Hughes, 1989; Wiener & Curry, 1980). Further, Moray (1986) has emphasized the
need for operators of automatic systems to have extensive manual practice even though it
will seldom be used in actual operation. Interestingly, however, recent accidents involving
automation issues have not shown manual skill proficiency to be a primary concern, since
accidents have generally occurred when the auto-pilot was flying, or the automation and
pilot were “fighting” for control of the aircraft, thus interfering with each other’s relative
control commands (Aviation Week, 1996). This is not to say that degradation of skill is not
a problem, but rather that other automation factors seem to be more causally related to
aircraft mishaps.
It is ironic, however, that proponents of automation have long argued that much of the
value of automation resides in the fact that pilots, when required, can easily intervene and
pick up where the auto-pilot left off. In reality there is considerable evidence that persistent
monitoring of automation leads to some loss of sensitivity to the subtle dynamic
relationships between system variables (Kessel & Wickens, 1982; Shiff, 1983; Wiener &
15
Curry, 1980). It is common for co-pilots, when transferring from highly automated wide-
body aircraft to narrow body, less automated aircraft, to need a transition period to revive
their proficiency in manual control skills (Wiener & Curry, 1980). Further complicating
this issue is the fact that with the introduction of highly sophisticated FMSs, complimentary
changes in airline procedures discourage manual flight (Billings, 1991). Rather than being
evaluated on their manual flying skills, pilot proficiency is judged by the effective use of the
vastly capable and complex “integrated flight path and aircraft management systems” in the
cockpit (Billings, 1991). It should also be noted that several recent incidents, for example
the crash of a USAir 737 in Pittsburgh, although as yet unsolved, may have been related to
a wake turbulence induced unusual attitude which became catastrophic when attitude
recovery failed, possibly due to pilot proficiency.
Reduction of Small Errors at the Cost of Occasional Large Errors
The introduction of technology into society has created the interesting phenomenon of
reducing small errors of precision, at the cost of occasionally introducing very serious large
errors. Consider the frequently cited example of the digital alarm clock. The introduction
of this device meant that the accepted 10 to 15 minute precision error of the analog alarm
clock was now eliminated (Wickens, 1992). However, this technology meant that the
occasional “set up error” (Wiener & Curry, 1980) could yield an error, although infrequent,
of 12 hours, nearly 48 times the magnitude of the analog alarm clock. The same potential
for occasional catastrophic errors exists in the automated cockpit (Wiener, 1988). Consider
the following example:
January 20th , 1992, Strasbourg, France:
16
An Airbus A320 flew into the ground while on a non-precision approach to
Strasbourg-Entzheim Airport in France. A post-crash analysis of flight data
determined that the aircraft was descending at a rate of 3,300 ft/min during its
pre-crash descent, far steeper than the 700 ft/min required by the approach
(Aviation Week, 1992). However, the VOR/DME approach chart for that
airport required a 3.3 degree angle of descent, which is what the pilots most
likely intended to input into the Flight Management System. Rather than
entering "3.3" into the "Track/Flight Path Angle" descent mode, the mode
"Heading/Vertical Speed" was inadvertently selected. This "mode error" by the
pilots caused the aircraft to descend at 3,300 ft/min, rather than the intended
3.3 degree angle of descent. The A320 crashed short of the airfield in
mountainous terrain killing 87 of the 96 passengers aboard. Taped crew
conversations indicate that the pilots never realized an error had been made.
This example, besides being an example of a “mode error” which will be discussed in detail
later in the paper, is a perfect example of the small-error/large-error tradeoff. Although it
would be difficult for human pilots to fly a perfect 3.3 degree angle of descent, it is unlikely
that in attempting to do so, they would error by such a large magnitude. The automated
system, however, can fly the aircraft at a 3.3 angle of descent nearly free of error, but must
be commanded to do precisely that.
There are abundant examples of such errors being made. However, most are detected
before the error is elevated to a catastrophic level (Billings, 1991; Wiener, 1988). Wiener
(1988) has suggested 4 approaches to dealing specifically with this problem. First, he
suggests that systems need to be less cordial to erroneous input at the human interface level,
rather than depending on training and correct operation to alleviate the symptoms of bad
17
design. Second, he suggests that systems be designed to be less vulnerable to unsafe actions
even in the advent of erroneous input (for example Ground Proximity Warning Systems).
Third, systems must be designed with error checking, or a certain “intelligence” capability
to deal with the logic of inputs given other relevant factors (for example, comparing pilot’s
altitude inputs with an internal terrain map). Finally, Wiener (1988) suggests that the
entire system, including Air Traffic Control, be designed to be less tolerant to overall
system error (e.g., insuring that aircraft follow the exact instructions provided by ATC).
None of the suggestions raised by Wiener (1988) are necessarily easy to implement, but
given the catastrophic outcomes of the “large error” problem in the modern cockpit, many
changes are being made in the direction of these ideals. Although GPWS and TCAS are
now mandatory forms of error checking (Van Cott, et al., 1996), improved FMS user
interfaces and terrain checking are in the process of being perfected. Further, this issue
highlights the importance of “smart” automation as advocated by some individuals in the
field (Van Cott, et al., 1996). Although the assumption is that smart automation would
detect and resolve large errors, ample evidence suggests that this is a complex and difficult
undertaking.
Are Multi-modal FMSs too Complex?
Early auto-pilots were simple devices which could turn to and hold a heading, climb to and
hold an altitude, or track a navigation signal for the purpose of decreasing the need for
continuous hands-on control of the aircraft (Billings, 1991). More important was that fact
that every behavior of the auto-pilot had to be specifically commanded by the pilot;
commands to the auto-pilot were never more than one flight transition away from the
18
current flight condition (e.g., the pilot could command the auto-pilot to turn to a specific
heading and hold that heading, but could not at that time input a future heading change
command). As automation transitioned into its second generation in the late 1950s
(Billings, 1991), automatic control of the aircraft became gradually more sophisticated,
with devices becoming autonomous from continuous pilot command. Examples of such
devices are the yaw damper, which automatically initiates slight rudder movement to
prevent the “Dutch roll” phenomenon in swept wing aircraft, and “pitch trim
compensators” which control the tendency for aircraft to pitch down at near-supersonic
speeds. Although these devices and others like them increased safety and efficiency, and in
some cases, made high speed transport a reality, they also set the precedent that
autonomous automation could be introduced into the cockpit safely and successfully without
undue concern for human factors.
As automation transitioned into its third generation (Billings, 1991), the objective of
integrating and managing the automatic systems to further reduce workload, increase
safety, and increase efficiency lead to FMSs with phenomenal capability. Not only could an
entire flight be preprogrammed into the system, but this execution of the flight could be
undertaken without pilot intervention. Intrinsic to this automation capability was that the
system would have many “modes” available to command the flight. Just as pilots have
several methods to accomplish the same task in an aircraft (e.g., on approach, one can
control altitude with power changes, pitch changes, or approach with the throttles at idle,
and control altitude with pitch and wing spoilers), so too were multiple capabilities built
into FMSs, both for efficiency, and to provide the pilots with greater flexibility. With this
increased ability, however, came a certain need for the automation to deal with peculiar
situations without pilot intervention. This meant that upon reaching certain predetermined
target values or reaching certain “protection limits,” (i.e., the system senses that an unsafe
19
condition has occurred) the FMS can change its “mode” of operation or over-ride pilot
inputs.
Automation with this level of sophistication has led to two specific pilot interaction
problems (Sarter & Woods, 1991). The first problem is that, given the inherent complexity
of the system, greater demands are placed on the pilots to understand the multiple
ramifications of each FMS “mode.” Because a particular mode may behave differently
under different circumstances (e.g., at different altitudes), the pilot must understand in
advance what the FMS will do given certain inputs, and remember what FMS abbreviation
is yielded at any input moment. Consider the following example:
February 14th, 1990, Bangalore, India:
An Airbus A320 flew into the ground while on short approach to Bangalore
Airport in India. The pilots had inadvertently set the auto-pilot to "Idle Open
Descent" mode, which sets the auto throttle to idle, rather than one of the two
descent modes in which auto throttles are active (Aviation Week, 1990).
Consequently, unbeknownst to the pilots, the aircraft slowed to a speed of 25
knots below the desired airspeed of 132 knots since altitude was maintained by
pitch rather than thrust. By the time the pilots realized their error, they were too
slow and too low to recover, and crashed short of the runway killing 94 of the 146
people on board.
The second FMS complexity problem is that the behavior of the automation is contingent
upon certain “situational” factors in addition to pilot inputs, often making it difficult for the
pilots to predict the behavior of the auto-pilot either upon engaging the auto-pilot, or in
20
monitoring its behavior as it progresses along the flight (Sarter & Woods, 1992). Consider
the following example:
April 26th, 1994, Nogoya, Japan:
An Airbus A300-600 stalled 1800 feet above the ground on approach to Nagoya
Airport, Japan, following a chaotic battle for control of the aircraft between the
pilots and the auto-pilot. While flying the aircraft manually with flight director
guidance and auto-throttles engaged, the co-pilot inadvertently engaged the TOGA
(take off/go around) lever on the throttle quadrant. Realizing the error, the captain
correctly instructed the co-pilot to disengage the auto-throttle. In attempting to
correct for the now off-glideslope condition, the pilots engaged the auto-pilots 1 and
2, believing that the auto-pilot would return them to the desired flight path.
Instead, the auto-pilot resumed the TOGA mode which had accidentally been
selected by the co-pilot previously. Realizing this, the pilots applied forward
pressure on the yoke to correct for the auto-pilot induced 18 degree nose up
condition. However, because the FMS software inhibits automatic “yoke force auto-
pilot disengagement” below 1500 ft, the auto-pilot remained engaged and initiated
movement of the “trimmable horizontal stabilizer” in the opposite direction. While
the pilots pushed down with all their strength, the trim system continued to push the
nose upward for twenty seconds until the pilots manually disengaged the auto-pilot.
Several seconds later, the extreme nose up condition and deteriorating airspeed
unexpectedly caused the “alpha floor” protection mode to engage due to excessive
angle of attack. This “alpha floor” condition commanded a thrust increase inducing
an even greater nose-up attitude. Although the captain promptly disengaged the
21
“alpha floor,” the aircraft was far out of trim, the airspeed was at 78kts, and the
altitude 1800 ft. The aircraft stalled and could not be recovered before it hit the
ground, killing 264 people. (Aviation Week and Space Technology, 1996)
Although this incident seems obscure and hardly believable, very similar incidents also
occurred in 1985, 1989, and 1991 (Aviation Week, 1996), and highlight the dangers of
highly capable, yet unintentionally complex auto-pilot systems. Interestingly, however,
such incidents point to an intriguing paradox of automation. As auto-pilots became more
sophisticated, they in fact begin to fly more like humans (albeit more precisely) using a
complex combination of methods to achieve their goal. However, as this occurs, it makes it
more difficult for those pilots monitoring the automation to predict and, in fact, understand
what the automation is doing. Given the flexibility of the FMS and the “dynamism of flight
path control,” serious cognitive demands are placed on the pilots (Sarter & Woods, 1992).
Not only must they decide the level and mode of automatic control, but they must diligently
track its behavior in a highly dynamic environment.
Sarter and Woods (1992, 1994), while seeking empirical evidence for pilots’ anecdotal
suggestions of confusion about the Flight Management System operation, found converging
and complementary data demonstrating both serious gaps in pilots’ understanding of the
system logic and difficulty in tracking the behavior of the FMS while in flight.
Surprisingly, Sarter and Woods (1992) found that 55% of Boeing 757 pilots surveyed
agreed with the statement “In B-757 automation, there are still things that happen that
surprise me.” Further, 20% of the pilots agreed with the statement: “There are still modes
and features of the B-757 FMS that I don’t understand.”
22
In a follow-up study, Sarter and Woods (1994) created an FMS command-laden
experimental scenario which was then flown in a part-task simulator designed to teach FMS
operations. The goal of this study was to observe pilots using the FMS in a simulated
mission to understand pilots’ mental representations of the FMS logic. The results showed
that the majority of pilots had little difficulty with routine operations ranging from
establishing a holding pattern to setting up for an ILS approach. However, they found that
70% of pilots showed deficiencies in one or more of the following less standard procedures:
1. aborting take-off with auto-throttles engaged,
2. anticipating mode indications on the ADI display throughout the take-off roll,
3. anticipating the arming of the go-around mode,
4. disengaging Approach mode after signal “capture,”
5. explaining speed management,
6. defining end-of-descent point for different modes,
7. describing the system behavior differences above and below 1500 ft for a loss of radio
“signal” condition.
An example deficiency was that 80% of pilots did not realize that aborting an auto-throttle
take-off required the pilot to manually disconnect (as opposed to an automatic disconnect)
the auto-throttles in order to prevent them from re-accelerating after manual intervention.
The authors (Sarter & Woods, 1994) attribute these deficiencies to two separate factors.
They see the first three deficiencies related to weak mode awareness, both in terms of
dealing with an FMS related failure and with anticipating system status and behavior. The
second factor, raised by the last four deficiencies, point to an impoverished knowledge of
the “functional structure” of the FMS (Sarter & Woods, 1994). It is quite obvious from
23
these findings (Sarter & Woods, 1994, 1992), and others (Wiener, 1989) that even
experienced pilots have trouble with the complexity of the FMS. Sarter and Woods (1992)
suggest that one of the primary problems with FMS systems is the poor feedback given to
pilots about the behavior of the FMS, exacerbating the already difficult task of predicting
system behavior. In fact, both accidents and empirical investigations have led to
considerable FMS changes (Aviation Week, 1995). However, continuing investigations
persist in revealing that pilots have trouble understanding and predicting FMS behavior,
despite improved feedback, interfaces, and training.
It has also been suggested that any system which is “multi-modal’ in nature is
difficult for human operators (Norman, 1988; Wiener, 1989), and thus problematic
regardless of interface and training issues. Further, the problem remains that even
if a crew has complete understanding of the FMS system, 84% of FMS related
reports to ASRS indicate that “programming errors” still present the highest
problem area (Aviation Week, 1995). The problems of detecting programming
errors is compounded by the fact that pilots have trouble predicting the behavior of
the FMS. Clearly, if some element of “wait and see” is built into pilots’
monitoring behavior, detection of programming errors may be delayed while
attempting to internally justify the FMS behavior. Consider the following
example:
December 20, 1995, near Cali, Colombia:
A Boeing 757 crashed into San Jose Mountain on approach to Cali, Colombia, while
performing a “Ground Proximity Warning System (GPWS) escape maneuver” to avoid
the mountain. A post-crash analysis of flight data and cockpit recordings determined
that the pilots of the aircraft entered a command into the FMS to fly direct to Tulua
24
VOR in order to comply with an ATC request to report their position once over the
VOR. However, the pilots failed to realize that they had already passed Tulua VOR, so
their command caused the FMS to turn the aircraft back in the direction from which
they had come. While the behavior of the aircraft surprised the pilots, they continued
to let the FMS turn the aircraft in the wrong direction for approximately 90 seconds.
With suspicion growing, the pilots switched the auto-pilot to the “heading select” mode
in an attempt to return the aircraft heading toward Cali. However, the 90 second turn
to the left, and then the corrective turn to the right placed the aircraft off course and in
a valley surrounded by high mountains. The pilots attempted a maximum performance
climb after prompting by the GPWS, but the 757 hit the top of a 12,000 ft mountain
killing 164 of the 167 individuals on board. (Aviation Week and Space Technology,
1996)
Only as evidence builds that the autopilot behavior is deviating from expectation will the
pilots begin to suspect a programming error. Although industry has advocated better
training and researchers have advocated better FMS interface and cockpit display design, it
seems likely that mode confusion will persist as long as FMS operation is optimized for
efficiency, and not simplicity.
Is Workload Lower, or Just Different?
As suggested earlier, the clear relationship between high workload and increased
probability for human error has been a strong force in the push toward cockpit automation.
Early systems automation was successful in reducing the workload for pilots (Billing, 1991)
which was welcomed given that pilots “prefer to be relieved of much of the routine manual
control and mental computation in order to have time to supervise the flight more
25
effectively and to perform optimally in an emergency” (Wiener, 1988). Further, airlines
have long desired wide-body aircraft requiring only two crew members, and the reduction of
workload in the cockpit was a prerequisite for such designs.
Defining workload and then measuring it has always been a difficult task for engineering
psychologists (Wiener, 1985), yet aircraft designers were quite confident that heightened
levels of automation would reduce workload to a large degree (Wiener, 1988). Two
interesting factors have arisen since highly automated aircraft were certified for two pilot
operation on the grounds that workload had been sufficiently reduced. First, it seems quite
evident from pilot studies that, while manual workload may have been effectively reduced,
mental workload was not reduced and may have actually increased. This is because the
automation management task is now part of pilot duties (Wiener, 1988).
Wiener (1988) suggests that automation now calls for more programming, planning,
sequencing, and alternative selection, all of which add up to considerable levels of cognitive
processing. In fact, a study by Curry (1984) of 100 Boeing 767 pilots found that only 47%
agreed with the statement, “automation reduces overall workload.” Responding to another
question, 53% of the pilots agreed with the statement, “Automation does not reduce
workload, since there is more to monitor now.” Although subjective in nature, it is clear
that many pilots find the automation management task quite demanding and perhaps more
demanding than the highly manual flying task which the automation replaced (Kantowitz
& Casper, 1988). Not only does this bring into question the validity of the certification
findings, but it implies that if the relationship between workload and error still exists in the
automated cockpit, automation may now be affording new opportunities for human error
based solely on increased workload levels.
26
Another factor related to automation induced workload is the temporal spacing of workload
throughout the different phases of the flight. Automation has reduced the workload in
cruise phases of flight to almost nothing (Billings, 1991). However, workload tends to
increase dramatically upon entering the “terminal” area because of two factors. First, since
terminal area flight usually requires some combination of directional, altitude, and speed
changes, not to mention potential FMS mode changes, the monitoring task becomes much
more involved. The pilots must monitor the aircraft’s behavior in an attempt to stay ahead
of the FMS, and they must also monitor the FMS commands to insure that the information
programmed into the FMS at the beginning of the flight is correct. The second factor in
increased terminal area workload is the commonly cited mismatch between Air Traffic
Control procedures and the FMS (Wiener, 1985). Assuming that ATC requires deviation
from a standard approach, which is often the case, the pilots must spend “heads down” time
reprogramming the FMS, while maintaining constant communication with ATC and
scanning for other aircraft. In the future ATC may communicate directly with an aircraft’s
FMS, thus reducing both communication and programming errors and allowing the pilots
greater opportunity for scanning for other aircraft, but this feature is still several years
away.
Automation Induced Complacency
Although traditionally a somewhat ill-defined concept, complacency is often mentioned as a
potential negative effect of automation as related to monitoring performance (Parasuraman,
Molloy, & Singh, 1993; Thackray & Touchstone, 1989). In addition, the term complacency
has been used to describe inadequate cockpit performance previous to highly automated
cockpits. Wiener (1981) has defined complacency as “a psychological state characterized
by a low index of suspicion,” while the ASRS coding manual defines complacency as “self-
27
satisfaction which may result in non-vigilance based on an unjustified assumption of
satisfactory system state,” (Parasuraman, et al., 1993). Singh, Molloy, and Parasuraman
(1993) in the development of a complacency rating scale, viewed complacent behavior as
one’s attitude toward automation coexistent with other factors. Singh et al. (1992) found
four independent factors revealing a potential for complacency, those being confidence,
reliance, trust, and safety related complacency.
Overconfidence in automation may not, however, be a strong enough factor itself to cause
complacency. Although Thackray and Touchstone, (1989), attempted to induce the effects
of complacency by having a reliable automated Air Traffic Control task fail both in the
beginning and end of a two hour experimental session, they failed to show a reliable
performance difference between the two failures. Further, their research did not yield a
difference in detection efficiency between the group with automated assistance and the
group who performed the task without assistance. Thackray and Touchstone (1989)
reasoned that their failure to find a difference may have been due to the short session, or
perhaps because the subjects performed only a monitoring task, with no other tasks
competing for resources. Parasuraman et al. (1993) reasoned that the effects of automation-
induced complacency are more likely when the operator is responsible for many functions,
as is often the case in aircraft incidents in which complacency was a factor. Singh et al.
(1993) reasoned that complacent behavior exists only when both a complacency potential
exists on the part of the pilots, and certain other factors coexist. Those factors include pilot
inexperience, fatigue, high workload, and poor communication (Singh, Molloy, &
Parasuraman, 1993).
Based on the reasoning that high workload may cause automation induced complacency,
Parasuraman et al. (1993) had subjects detect failures of an automated system monitoring
28
device while those subjects controlled a fuel management system and a tracking task.
Automation reliability was manipulated with groups either seeing high or low constant
reliability or variable reliability automation alternating from high to low every ten minutes.
Additionally, subjects were placed in a “monitor only” group or were in a group which
monitored and controlled all tasks. Results clearly showed that detection of automation
failures was worse for subjects in the constant reliability condition. Results also showed
that subjects whose only task was to monitor showed no performance differences due to
automation reliability. This finding supported earlier findings (Thackray & Touchstone,
1989) that workload must reach a certain level before complacency related performance
deficits will be seen. The authors viewed these results as the first evidence that automation
induced complacency could be produced by high workload and highly reliable automation.
These findings are significant in the operational setting because workload can be very high
at certain times and the automation extremely reliable. The problem however, is that the
automation is not perfectly reliable. As discussed earlier in this paper, pilots often enter
incorrect information into the FMS which then diligently carries out exactly what it is
commanded to do. In addition, the automation is capable of failure even when the correct
information is entered into the system. Consider the following example cited to this author
from a 767 captain:
While in cruise over the Mediterranean en route from London to Cairo, the pilots
of a Boeing 767 monitored as the FMS flew the aircraft. Unbeknownst to the
pilots, the auto-throttle was gradually but erroneously reducing the thrust being
supplied from the engines. While this was occurring, the auto-pilot was
responding to the reduced thrust by pitching the aircraft up ever so slowly to
maintain the altitude specified by the FMS. Because of the moderate rate of thrust
29
reduction and smoothness with which the auto-pilot responded, the pilots failed to
sense the cues normally associated with changes in pitch. Fortunately, the captain
eventually noticed that the airspeed was unusually low, and manually accelerated
the throttles. However, by the time the anomaly was noticed, the airspeed had
dropped 25kts below the appropriate cruise speed, and only 15kts above stall
speed.
The problem of automation induced complacency is that the complacency is unjustified in
operational settings. Not only does automation simply fail to perform correctly, but
programming errors and FMS misunderstandings by the pilots create an environment
where complacency is unjustified.
Situational Awareness
As the issue of out-of-the-loop performance has become increasingly important (Endsley,
1995), new terminology, research methods and constructs have evolved to suit this research
area. Of these, the concept of “situational awareness” has evolved as a means of both
conceptualizing the problem, and, in fact, measuring it. The use of situational awareness as
a causal agent is strongly supported by some (Endsley, 1995) or used only as a label for a
variety cognitive processing activities by others (Sarter & Woods, 1995). It is viewed as a
“buzzword of the ‘90s,” rather than an effective research paradigm (Wiener, 1993), and
viewed as an obstacle to research, rather than as a phenomenon description, by still others
(Flach, 1995). Because of the effort dedicated to this research paradigm in both civilian
and military settings, situational awareness will be treated as a valid construct for the
purposes of this paper, discussing how it has been used to conceptualize and measure the
“out of the loop” performance problem.
30
Although there have been numerous definitions proposed for situational awareness, most
have not been applicable across different task domains (Endsley, 1988). However, the
definition settled on by the most prolific researcher in this area is as follows (Endsley,
1995): “Situational awareness is the perception of the elements in the environment within a
volume of time and space, the comprehension of their meaning, and the projection of their
status in the near future.” Further, Endsley (1995) has divided situational awareness into
three hierarchical levels. Level 1 situational awareness is described as the perception of
elements in the environment. Those-task specific elements include the status, attributes,
and dynamics of the environment which are specifically pertinent to effective performance.
Level 2 situational awareness is the comprehension of the situation based on the synthesis
of disjointed Level 1 elements. Most importantly, however, is the fact that this level of
comprehension includes an understanding of the significance of the perceptual elements in
light of the operator’s goals, providing a holistic picture of the environment to the operator.
Level 3 situational awareness is the ability of the operator to project the future actions of the
elements in the environment based on level 2 situational awareness. The highest level of
situational awareness “is achieved through knowledge of the status and dynamics of the
elements and comprehension of the situation,” (Endsley, 1995).
Although relatively little research has been conducted using situational awareness as the
dependent variable, Endsley and Kiris (1995) used an expert-system-aided navigation task
to study the effects of differing levels of automation on workload and situational awareness.
Using five levels of automation, the authors hypothesized that both workload and
situational awareness would decrease as the level of automation increased. Measuring a)
decision time upon automation (expert system) failure, b) decision selection, c) decision
confidence d) workload and e) situational awareness, the authors found that as the
31
automation level went up, decision time following an automation failure also went up.
Further, situational awareness also went down as the level of automation went up,
confirming the authors’ hypotheses. Interestingly, however, only level 2 situational
awareness was affected by automation, leading the authors to speculate that subjects who
relied on automation may not have developed a higher level of understanding of the
situation. Significantly, workload levels were unaffected by automation, mirroring other
research and anecdotal findings (Billings, 1991) that automation does not necessarily
correlate with reduced workload. Surprisingly, higher confidence levels corresponded with
higher levels of automation, even though their decision times were longer and situational
awareness lower.
Whether or not one supports the use of situational awareness as a theoretical construct, or
as merely a general descriptive concept, there is little doubt that the research that has been
done has successfully captured some difference in operator knowledge based on automation
level. Further, in terms of conceptualizing and communicating the nature of the “out of the
loop” performance problem, research in this area has been beneficial. Most importantly,
however, if one looks at this research as part of a body of research which has attempted to
measure operator performance in terms of level of automation, the findings are generally
consistent with other research in the field demonstrating reduced operator efficiency when
placed “out of the loop” (Johannsen, Pfendler, & Stein, 1976; Kessel & Wickens, 1982;
Wickens & Kessel, 1979-1981; G. Young, 1995; Young, 1969).
32
Mental Models
The concept of the “mental model” as an explanatory device for human cognition is not a
new one, nor is interest in mental models exclusive to cognitive and engineering
psychology (Wilson & Rutherford, 1989). In fact, mental models have been used as an
explanatory construct in manual control literature for over thirty years (Rouse & Morris,
1986). This body of literature commonly used the phrase “internal model” to describe the
“images” that individuals use to organize and execute daily procedural activities or to
operate complex devices (Jaginski & Miller, 1978). While the originator of the mental
model notion is likely Kenneth Craik (1943), Johnson-Laird (1983) instantiated and
popularized the notion of the mental model (and in fact the more sophisticated and
relationally complex conceptual model) as a legitimate construct for cognitive psychology.
Johnson-Laird’s (1983) conceptualizations of mental models gave way to a more open
embrace of this concept by cognitive psychology in the early eighties (Rouse & Morris,
1986). Interestingly, however, while the manual control commentary viewed this concept
as generally self evident (Rouse & Morris, 1986) and therefore a suitable assumption which
allowed calculations of expected control performance, the cognitive community focused
more directly on the “mental model” as a phenomenon (Rouse & Morris, 1986), even
though Johnson-Laird’s largely functional approach avoided issues of fundamental mental
processes (Wilson & Rutherford, 1989). Norman (1988) explained peoples’ interactions
with and understanding of devices by distinguishing between conceptual models,
characterized as the appropriate model which a system designer desires the operator to
have, versus the mental model, which is what the operator actually develops through device
interaction.
33
Even though the use of the concept of a mental model is fairly common in the literature, it
has suffered from a lack of explicit definition (Rouse & Morris, 1986). Johnson-Laird
(1981) stated “A [mental] model represents a state of affairs and accordingly its structure
[which] plays a direct representational or analogical role. Its structure mirrors the relevant
aspects of the corresponding state of affairs in the world.” Rouse and Morris (1985) have
defined a mental model as: “mechanisms whereby humans are able to generate descriptions
of system purpose and form, explanations of system functioning and observed system states,
and prediction of future states.” Carroll and Olson (1987) have defined mental models as
“a rich and elaborate structure, reflecting the user’s understanding of what the system
contains, how it works, and why it works that way. It can be conceived as knowledge about
the system sufficient to permit the user to mentally try out actions before choosing one to
execute.” Borgman (1986) summarizes the perspective of the research in the human
computer interaction community on mental models as “a general concept used to describe a
cognitive mechanism for representing and making inferences about a system or problem
which the user builds as he or she interacts with and learns about the system. The mental
model represents the structure and internal relationship of the system and aids the user in
understanding it, making inferences about it, and predicting the system’s behavior in future
instances.” Endsley (1995), in her development of the situational awareness paradigm,
states that a well developed mental model provides: (a) knowledge of relevant system
elements that direct attention and classify information in the perceptual process, (b) a
means of integrating elements to form an understanding of their meaning, and (c) a
mechanism for projecting future states of the system based on its current state.
Regardless of the specific author, however, most definitions contend that a mental model is
a form of subjective representation of external reality, and allows implicit use of the model
34
for the purpose of “thinking” about the system. This fortunately renders them functional
and affords the user some explicit, though limited, ability to consciously run the model.
Equally important, however, is the notion that a user’s mental model is seldom a perfect
analogy to the real system, and is “surprisingly meager, imprecisely specified, and full of
inconsistencies, gaps, and idiosyncratic quirks,” and quite often possesses blatant
superstitions (Norman, 1983).
The purpose of this discussion, however, is not to review the relative merits and theories of
mental models, but rather to discuss the way in which the general conceptualization of
mental models is useful in understanding the “out of the loop” performance problem in
highly automated aircraft. Aviation presents itself as a unique domain for the study of
mental models primarily for two reasons. First, in aviation nearly all of its operators,
especially those in the commercial domain, can generally be considered domain experts.
Further, not only are its participants highly trained and versed in aviation related concepts,
but all must perform a nearly identical task. This is not to say that all pilots have identical
mental models, or that their models are a perfectly balanced representation of the real
system. However, as a population of experts they most certainly have very similar models
of the system, and their models are by necessity fairly accurate representations.
The second factor that makes aviation unique is the high level of complexity which must be
part of the flight task mental model. Not only must the pilot’s model include the traditional
manual controlling model in order to fly the aircraft, but the pilot must also have the
aircraft systems, airspace system, air traffic control system, communication, navigation, and
most importantly, the current dynamic state of the aircraft in relation to all the other
systems as part of that model. This notion is not unlike the perspective held by Williams,
Hollan, and Stevens (1983) that mental models are composed of autonomous objects with
35
an associated topology; an autonomous object being a mental object with an explicit
representation of state, set of rules governing parameters, and an explicit representation of
its topological connections to other objects. In addition, I propose that there must be two
levels of the same model: a static, schema like model of the system, and a real-time
dynamic execution of the model.
The static model is much like any operator’s model of a particular device, allowing the pilot
to clearly describe the operations of all the systems and the relationships between those
systems. When flying, however, the static model is the basis for the activation of the
dynamic execution. The dynamic execution is, in essence, the activation of the static model
with variable data entered into hypothetical “slots.” The activation of this model, however,
is not uniform, but rather a system with varying levels of activation in which components of
the model that are required for efficient task completion are most activated, with those
components unnecessary to the task remaining relatively inactive. As the operator
accomplishes the task, those areas of the static model which have become activated remain
that way for some time even when the task no longer supports the activation of the model.
The areas of activation provide the pilot with quick and easy access to those areas, and
benefit the pilot through more efficient cognitive and perceptual processing of features
related to the activated areas.
For example, a pilot, while on the ground, can explain the relationship between pitch,
power, altitude and airspeed. While flying, and especially while initiating a descent, the
pilot must use the information is this model, in combination with elements of the present
dynamic environment (i.e., current airspeed, throttle setting, pitch and altitude) in order to
execute the descent properly. I contend that during the execution of a task, in this case the
execution of a descent, the relevant portions of the static mental model and all their
36
associated elements become activated. Not only does activation allow for proper execution
of the task but, according to Endsley’s (1995) description of a well developed mental model,
“the model will provide (a) for the dynamic direction of attention to critical cues, (b)
expectations regarding future states of the environment (including what to expect as well as
what not to expect) based on the projection mechanisms of the model, and (c) a direct,
single-step link between recognized situation classification and typical actions.”
The unique feature of the commercial pilot, however, is that he has a well developed mental
model of the flying task, yet there are frequent examples (this paper and Billings, 1991) of
pilots failing to perceive and integrate information as would be expected given the quality
of their mental model (and its supposed level of activation). Most importantly, Endsley
(1995) points out that a mental model should provide for the dynamic direction of attention
to critical cues. Yet it often seems that pilots fail to attend to critical and sometimes life
threatening cues which should be perfectly salient. This dynamic execution theory of
mental models predicts that any weakening of activation would hinder the operator’s ability
to perceive critical elements in the environment, and would thus lead to conditions in which
critical cues are not perceived and integrated into useful information.
The proposed theory is somewhat similar in vein and at least tangentially related to
Neisser’s perceptual cycle (Neisser, 1976) as put forth by Adams, Tanney, and Pew (1995)
in their conceptualization of situational awareness as an active cognitive process. Neisser
views perceptual acuity and efficiency as a function of cognitive structures available at the
time of perception. Neisser states, “Because we can see only what we know how to look for,
it is these schemata (together with the information actually available) that determine what
will be perceived. At each moment the perceiver is constructing of certain kinds of
information, that enable him to accept it as it becomes available” (p. 20). The cyclic nature
37
of Neisser’s theory implies that each perceptual event results in a modification of the
schema which then “directs further exploration and becomes ready for more information.”
Neisser’s theory suggests that effective perceptual activity is contingent upon the quality
and nature of the previous perceptual cycle. If the activity undertaken by an individual is
different from the activity suggested by the operator’s primary goal, then the perceptual
cycle which proceeds may be ineffective for guiding perceptual activity. According to the
dynamic execution theory, a mental model which remains fairly static (e.g., when the
operator inactively monitors for long durations) will likely lead to a perceptual system
unprepared for the consumption of critical information, or perhaps prepared for the wrong
information. Neisser’s perceptual cycle (1976) also suggests that as the operator’s task
shifts there should be a transitory period during which an inadequate perceptual cycle must
be replaced in favor of a more appropriate, and thus effective, perceptual cycle.
The next section will review previous research from controlled empirical studies which
examined human monitoring behavior in manual and automated systems, some of which
allude directly to the notion of mental models or similar concepts. In fact, Endsley and
Kiris (1994) suggest that some forms of manual control may lead to “maintenance” of an
operator’s mental model. While certainly related, such suggestions are problematic given
the quality of the mental model possessed by experienced pilots. Further, although there is
an implication in some studies that manual control improves cue sensitivity (Johannsen,
1976; Kessel & Wickens, 1982; Wickens & Kessel, 1979-1981; G. Young, 1995; Young,
1969), such claims are generally not explicit.
I hope not only to shed some new light on the results of past research using the dynamic
execution theory, but present new experimental results which further support the hypothesis
38
that a prerequisite of effective sensitivity to key elements in a dynamic process
environment, and correct integration of and response to those elements, is contingent upon
mental model activation stimulated by in-the-loop control.
Relevant Research
Given the proliferation of automation in modern cockpits, and the anecdotal and theoretical
support for the view that automation in cockpits should be approached cautiously (Billings,
1991), there is surprisingly little controlled, empirical research dealing with this issue.
Most of the research comparing monitors and controllers in automated, dynamic tasks has
employed tracking or flight control tasks with simulated flight dynamic shifts implicating
control system failure (Johannsen et al., 1976; Kessel & Wickens, 1982; Wickens & Kessel,
1979-1981; G. Young, 1995; Young, 1969) or actual flight tasks with a failure of the
automated system (Eprath & Curry, 1977). Others have used cognitively oriented decision
making tasks (Endsley, 1995; Parasuraman, Molloy, & Singh, 1993; Thackray &
Touchstone, 1989). Findings from these flight- and tracking-task experiments have
generally supported a failure detection performance advantage for system controllers
(Endsley, 1995; Parasuraman et al., 1993; Johannsen, Pfendler, & Stein, 1976; Kessel &
Wickens, 1982; Wickens & Kessel, 1979-1981; Young, 1995; Young, 1969) although some
have found either no difference (Thackray & Touchstone, 1989) or a failure detection
advantage for monitors (Eprath & Curry, 1977). These problematic research findings have
been attributed to a task which was unnecessarily biased for the system monitors (Young,
1995), having an experimental paradigm in which workload was too low (Parasuraman,
Molloy, & Singh, 1993 ) or experimental trials that were too short in duration (Thackray &
Touchstone, 1989) to show any effects of impoverished failure detection performance by
monitors.
39
The methodological approach used in the present research is based on studies which found
superior failure detection performance for manual controllers on a tracking task (Kessel &
Wickens, 1982; Wickens & Kessel, 1979-1981; Young, 1995; Young, 1969). More
importantly, those subjects who controlled manually were also better at detecting failures
when they then transferred to the monitoring task. The transfer effects found in this
research offer the strongest evidence that differences between monitors and past-controllers
may be related to differences in subjects’ mental models of dynamic systems. The next
section will therefore focus primarily on related experiments in which monitors and
controllers were compared in a transfer condition. These findings are both theoretically
and operationally more significant and are the basis for the current research.
Paradigm History
Young’s (1969) single axis tracking task which found superior performance for controllers
was improved and expanded by Wickens and Kessel (1979) who designed a similar
experiment using a two dimensional pursuit tracking task to increase task complexity.
They also addressed concerns that Young’s (1969) auto-pilot methods may have been
biased against monitors by implementing a non-adaptive auto-pilot, making visual detection
of failures easier for monitors. Their results demonstrated that as subjects switched from
controlling to monitoring, latency of failure detection decreased considerably, while
accuracy increased slightly. Wickens and Kessel (1979) determined that the superior
performance of the controllers was due to the additional channel of proprioceptive
information available to controllers as they adapted to the failed condition. Interestingly,
analysis of latency distributions demonstrated that controllers only maintained an advantage
in the first few seconds after the onset of a failure. Because of the short, transient nature of
40
a proprioceptive standard, detection must occur within the first seconds, otherwise subjects
resorted to the visual channel for failure information. This finding further strengthened
their argument that the controller advantage was a result of proprioceptive feedback.
In addition to the proprioceptive channel available to controllers, the authors hypothesized
that superior performance may have been due, in part, to a more consistent conceptual
model of the dynamic system. Less variability in a subject's conceptual representation of a
system should enhance the subject's ability to detect deviations from a normal state. This
was based on the view that a conceptual model of greater consistency developed as a result
of the controller's ability to differentiate between one’s own inputs and those acting upon
the system externally (e.g., turbulence). In addition to having an internal model of greater
consistency, it was believed that controllers in a system could test hypotheses about the
general state of the dynamic system through subtle system inputs, reinforcing and testing
the features of that model (Kessel & Wickens, 1982).
In order to determine the role of workload in failure detection performance, Wickens and
Kessel (1979) employed a secondary task in their experiment. As the side task was added
and its difficulty increased, no marked decrease occurred in detection performance of either
controllers or monitors, demonstrating a lack of interaction between participatory mode and
workload. Instead, higher levels of workload shifted the speed-accuracy bias toward speed
rather than accuracy.
Wickens and Kessel’s (1979) finding raised the question of why the increased workload of
the manual tracking task did not have a negative impact on failure detection performance
like that found by Ephrath and Curry (1977). Wickens and Kessel (1980) pointed out that
the manual tracking task and the failure detection task may not be competing for the same
41
resources, as had been previously believed. Moreover, the activation of the resources
allocated to the tracking task were those that allowed subjects to utilize proprioceptive
feedback in the detection process. This suggested that these operations work in
cooperation, rather than in competition, with each other. Using the same experimental
paradigm, but employing multiple secondary tasks focusing on structure-specific resource
allocation, Wickens and Kessel (1980) concluded that controlling and monitoring actually
rely on different processing resources to detect failures. Failure detection while monitoring
relies on perceptual/central processing resources, because monitoring is primarily a visual
task. However, while controlling subjects rely on a response-related reservoir separate from
central processing resources because of the proprioceptive nature of the task.
Although evidence and theory suggested that a subjects’ conceptual representation of a
system may positively affect failure detection performance of controllers, previously
mentioned studies (Wickens & Kessel, 1979; Young, 1969) employed repeated measures
designs that had subjects perform both monitoring and controlling of the failure detection
tasks. Therefore, even if subjects developed a performance enhancing conceptual
representation while controlling, this advantage would have been available to the subjects in
either participatory mode.
Wickens and Kessel (1979) hypothesized that concurrent development of both a controlling
and a monitoring conceptual model negatively affected the performance of controllers. This
was based on evidence suggesting that visual information caused a reduced sensitivity to
proprioceptive information, especially when the two sources contradicted each other
(Posner, Nissen, & Klein, 1979). Therefore, because of a strictly visual-cue based model
developed while monitoring, subjects may have had the tendency to rely on faulty visual
cues while controlling. This bias toward visual cues when the two information sources were
42
in conflict therefore negatively affected the performance of controllers. Of course,
switching to a between-subjects design eliminated this problem.
Kessel and Wickens (1982) isolated the impact of subjects' conceptual representations on
their failure detection performance by switching to a between-subjects, transfer-of-training
design. In this study, three groups of subjects were used: the first group transferred from
controlling to monitoring, the second group transferred from monitoring to controlling, and
the third group monitored in both sessions. Consistent with expectations, monitors took
longer to respond to system failures and made more errors than controllers. Further, the
magnitude of improvement of controllers versus monitors with this design was
approximately five times that found in the previous repeated measures designs, thus
confirming the view that the monitor/controller conceptual model bias had been
undermining controller performance and perhaps aiding subjects while monitoring.
The most powerful demonstration of the importance of conceptual representations,
however, was the significant increase in the performance of subjects during monitoring who
had controlled during the first session (Kessel & Wickens, 1982). This result indicated that
controlling not only led to the development of a conceptual model that aided in detecting
failures, but that the model was powerful enough to affect performance on a task which no
longer supported the features of that particular conceptual model. From the standpoint of
dynamic process control, this finding suggests that many of the benefits of automation can
be utilized while allowing the operator, through proper training, to maintain a conceptual
model optimal for detection of subtle changes in system performance (Young, 1995).
Kessel and Wickens’ (1982) transfer of training design was replicated by Young (1995),
who improved on the design by implementing a yoking procedure that insured identical
43
visual stimuli for both controllers and monitors, thus eliminating auto-pilot induced biases.
Further, Young (1995) addressed concerns that Kessel and Wickens’ (1982) transfer effects
may have been attributable to simple vigilance factors and not conceptual model differences
by creating a condition with a high rate of failures and a very short trial length (80 failures
in just over six minutes). If the earlier studies’ results represented merely vigilance related
effects, then the results of a very short experiment with a high rate of failures would not
find the same increased performance for controllers.
Young (1995) successfully replicated Kessel and Wickens’ (1982) results, showing that
when active controllers are transferred to the monitoring task they are better at detecting
failures than subjects who only monitored. This was additional evidence that features of the
controlling task transfer to the monitoring condition, and Young’s (1995) yoking
methodology insured that both controllers and monitors, when compared directly, received
identical visual stimuli. Young (1995) also found a nearly identical pattern of results when
the experiment was reduced in length from 45 minutes to just over six minutes. This
finding further supported the hypothesis that the improved failure detection performance
was due to an improved conceptual model guiding focus to relevant visual cues.
Present Research
Taken together, the results of Kessel and Wickens (1982) and Young (1995) strongly
suggest that individuals who control a simple dynamic system have an advantage in
detecting failures of that system when monitoring compared to individuals who only
monitor. Further, this research suggests that controllers develop a conceptual model of the
system which makes them more sensitive to subtle cues implicating system failure.
Although these findings are significant, they are limited in scope given the largely psycho-
44
motor nature of the tracking task employed. Although controllers may in fact have a more
effective “conceptual model” of the system, this model bears little resemblance to the
multiple, schema-based mental model possessed by operators of complex dynamic systems
Although the pilot of an aircraft, for example, may have a motor schema for manual control
of the aircraft, this is but one component of a mental model of far greater complexity. An
operator of a two dimensional tracking task has essentially one display to guide his control,
yet the aircraft pilot has multiple displays to track, not to mention 6 six degrees of freedom
rather than two, and out-of-cockpit, tactile, and aural information to guide his control.
The objective of the present research is two-fold. In Experiment 1 I will attempt to
replicate the findings of Kessel and Wickens (1982) using a more complex, non psycho-
motor aviation-like dynamic task. This experiment not only seeks to replicate the original
finding that controllers show better monitoring performance, but also to validate this
paradigm as an improved experimental platform for exploring the idea that a better
“conceptual” model is responsible for improved performance.
The primary objective in the design of the experimental paradigm was to create a task
which supported “inferential monitoring” (Wiener, 1984; Parasuraman, 1986). In such
tasks, the monitor collects data from the display, each sample being regarded as a
“sequential sample from a population of known parameters” (Wiener, 1984). As the
monitor of this system, the operator entertains a “rolling null hypothesis” that system
parameters have not changed, but responds when some change in the parameters has been
detected.
Although the particular task is generated from aviation type components, the combination
of these tasks is synthetic, and simple enough so that an individual can acquire the basic
45
principle and operational requirements in a half-hour of training. The task is, however,
highly analogous to many forms of dynamic process control where a failure of some sort is
not reflected in a single value, but rather in an apparent shift in the population mean
(Wiener, 1984) and thus inferential in nature. The task was designed so that failure
detection requires a synthesis of several features of the task making detection from a single
signal nearly impossible.
Based on the view that aircraft pilots have a reasonably complex mental model of the flying
task and that numerous subtleties are built into this model (e.g., the sensory stimuli one has
while initiating a descent), every effort was made to include operational subtleties as part of
the system. These subtleties would, at least in theory, become part of the operator’s mental
model. Further, a mastery of these subtleties would enhance one’s ability to infer a failure
since system subtleties initially have the effect of masking actual system behavior, but lose
and eventually reverse this effect as proficiency with the system increases.
Creating a paradigm that requires inferential monitoring for effective failure detection
would provide evidence that a more effective mental model can assist in the detection
process. However, such a finding would not necessarily exclude a general vigilance
explanation. Therefore, a second failure detection task was added which would represent
the more traditional signal/no signal vigilance task. This failure type was represented by a
bold red indicator surrounding a fuel pump and is analogous to a sub-system indicator light
illuminating in a cockpit. Because indications of the failure are explicit and unpredictable
from system behavior, this failure is completely and intentionally non-inferential.
In addition to the two failure types, this experiment employed two different auto-pilot types.
The first type of auto-pilot was the “yoked” type as used originally by Young (1968) and
46
later Young (1995), in which monitors’ visual stimuli consisted of recorded representation
of controllers’ performances. This method is experimentally superior, as it insures that
visual stimuli received by both controllers and monitors are identical in all conditions.
However, it has the disadvantage of providing visual stimuli which, in terms of auto-pilot
like behavior, is unrealistic. Thus, effects found could be criticized in terms of validity,
since monitors of real dynamic process control systems typically see the system operated in
an optimally efficient manner. For this reason, a third condition was added that used an
“optimized” auto-pilot. This optimized auto-pilot system regulated pump activity in a
highly efficient manner so that fuel levels were always within the “safe” areas, and the
throttle setting perfectly mirrored recommended throttle settings.
Experimental Hypotheses, Experiment 1
This experiment used a completely new experimental paradigm to test well-replicated
findings that past controllers make efficient monitors (Johannsen, et al., 1976; Kessel &
Wickens, 1982; Wickens & Kessel, 1979-1981; Young, 1995; Young, 1969). This research
is thus somewhat exploratory in nature. However, it is expected that the findings will again
show that controllers, when compared directly with monitors, are superior at detecting
failures. It is possible, however, that the higher workload resulting from the controlling
task will negate some of the typical benefits of controlling (e.g., hypothesis testing,
conceptual model improvements).
More importantly, however, it is expected that controllers will be more efficient monitors
when compared to individuals who monitor in both conditions. Further, these differences
should appear only in the inferential monitoring task, and not in the simple explicit
detection task. This expectation is a result of the hypothesis that the improved controller
47
performance observed in the past is due to an activated mental model guiding subjects to
subtle system cues. It is also hypothesized that any differences between controllers and
monitors will be seen in both yoked and optimized auto-pilot conditions, since both auto-
pilot types have in the past shown differences between monitors and controllers.
48
EXPERIMENT 1
Method
Subjects
Thirty-eight right-handed male university students were used in the experiment. Students
were paid a base rate for their participation in the experiment. Additionally, subjects were
given the opportunity to earn a five dollar bonus for good performance. All subjects had
20/20 or corrected 20/20 vision.
Apparatus
A 50 MHz Intel 486 PC with a 17 inch color CRT display was used. A spring centered,
dual-axis hand control (CH Products FlightStick) with a finger operated trigger was
connected to the PC via a 12 bit A/D converter. The subjects sat in a cushioned, semi
reclining chair, with a rest supporting their arm and the “joy stick.” The seating position
yielded an eye to display distance of approximately 100cm. The room containing the
apparatus was darkened, with primary light being provided by a red bulb for the purpose of
simulating a night cockpit environment.
Task
A discreet, single dimension tracking task was used in combination with a fuel
management task in the aviation-based simulation (see Appendix A.) The display
contained a “pictorial” representation of an aircraft fuel system with tanks in each wing,
two in the front, and two in the rear of the aircraft. Fuel tanks were interconnected with a
49
series of symbolic fuel lines showing fuel flow direction, and boxes on the fuel lines
represented pumps which were either on or off. The fuel management portion of the task is
similar to the Multi-Attribute Task Battery (Comstock & Arnegard, 1992) fuel management
task used by Parasuraman et al. (1993, 1996). The throttle level and recommended throttle
setting which made up the discrete tracking task were located in the right portion of the
display and the aircraft’s speed was displayed digitally in the nose of the aircraft.
The single dimension, discreet tracking task required subjects to use the joy stick in order to
match the aircraft’s current throttle level with the “recommended throttle setting” level.
The current throttle setting was indicated by a yellow bar, while the “recommended throttle
setting” was indicated by an adjacent blue bar. Throttle position directly controlled the
displayed speed of the aircraft, which was explicitly displayed in the nose of the aircraft but,
more importantly, throttle position controlled the amount of fuel consumed by the aircraft.
The relationship between throttle position and speed was linear, but the relationship
between speed and fuel consumption was non-linear. Therefore, higher throttle positions
resulted in greater fuel consumption at an increased proportion of fuel to speed. This
speed/fuel consumption relationship meant that doubling the speed, for example, resulted in
greater than double the fuel consumption.
The fuel management task involved the on/off manipulation of six fuel pumps which
controlled fuel flow between fuel tanks. Subjects manipulated the fuel pumps by toggling
keys on the keyboard which were both mapped to the general layout of the fuel pumps, and
were labeled with a specific fuel pump number. The fuel management task required
subjects to manipulate the fuel transfer pumps in order to keep fuel levels in the four main
tanks at “safe” levels, indicated by yellow bars on the fuel tanks. Subjects were told that
50
their task was to pump fuel out of the wing tanks and into the front and rear tanks so that
aircraft balance would remain in equilibrium.
The task was made more difficult by three subtle features of the system. First, as mentioned
earlier, although fuel depletion from the rear tanks was controlled by the speed of the
aircraft, the relation between aircraft speed and fuel consumption was non-linear, so that
subjects had to pay close attention to the throttle level in order to predict fuel consumption.
Second, the fuel tanks, although pictorially similar in size, had different fuel capacities so
that a single pump would have a different effect on the amount of fuel displayed between
the two tanks being affected. Third, the fuel pumps had different flow rates, so that a
pump’s flow rate was contingent upon the location of that pump in the fuel system.
Two types of failures occurred in the system, each representing a different type of fuel
system failure (see Appendix A.) The first type of failure, the signaled pump failure, was
indicated by the symbolic pump border changing from thin white to a highly salient thick
red. Subjects had five seconds to detect this failure. If the failure was detected, or time
expired, the red border returned to white. Subjects were told that a pump failure indicated
a problem with a pump, but that pressing the trigger returned the pump to normal
functioning. A pump failure was totally unrelated to the pump or fuel tank behavior, and
was thus only detectable by the change in the fuel pump border.
The second failure type was the inferential failure, called a “pressurization” failure, and
was indicated by abnormal behavior of fuel levels within the four fuel tanks. A
pressurization failure occurred when the fuel level in one of the four main tanks increased
or decreased in a manner inconsistent with what would be expected given: a) fuel pump
activity and b) rate of aircraft’s fuel consumption. This task was made more difficult by the
51
subtle system features mentioned previously. Subjects had 16 seconds to detect a
pressurization failure. If the failure was detected, the abnormal fuel flow behavior stopped.
If the failure went undetected during the 16 second failure duration window, the abnormal
fuel flow ceased and the tank level remained at its new level.
Experimental Design
Three groups participated in the transfer of training, between-subjects design. The first
group controlled the first day of the experiment, the second group monitored in the
“optimized” auto-pilot condition, and the third group monitored in the “yoked” auto-pilot
condition. The second day (transfer day) all three groups monitored both the auto-pilot and
yoked conditions in four 14 minute trials with two counterbalanced trials of each
monitoring condition.
The experimental portion of each day consisted of four 14 minute trials with a two minute
break between each trial. Each 14 minute trial had seven pump failures and seven
pressurization failures. Failure type and failure sequence was randomized, and time
between failures was between 20 seconds and three minutes. (See Figure 1).
52
Participatory Mode, Experiment 1
Session 1 Session 2
Day 1 Day 2 (monitoring)
Controllers Auto-pilot
Control
Yoked
Auto-pilot
"Auto-pilot" monitors
Auto-Pilot
Yoked
Auto-pilot
"Yoked" monitors
Yoked
Yoked
Figure 1. Experimental design, Experiment 1. Session 2 counterbalanced by condition.
Training
The training consisted of part-and whole-task practice for the first thirty minutes of the first
day. Subjects either received practice controlling or monitoring each component task, then
received practice with the whole system, first with performance feedback, then without.
After the practice session subjects were instructed to ask the experimenter if they had any
questions about the task, and all questions were answered. Subjects were also given ten
additional minutes of monitoring training at the beginning of the second day. All subjects
saw both auto-pilot types. Subjects were told that during the experiment they would be
exposed to both auto-pilot types which they had seen in practice.
Considerable emphasis was put on how the system operated in terms of its “structure and
processes” (Kieras & Boviar, 1984). The mechanics of the system were explicitly explained
53
(e.g., “Pump P1 controls fuel flow from the left wing tank to the front fuselage tank.”), the
subtleties of system behavior were explained (e.g., “Pump P1 has twice the fuel pumping
capacity as pump P3.”), and the concept of the system was explicitly explained (e.g.,
“airplanes are sensitive to the location of weight therefore making it important that fuel be
equally distributed in the fuel tanks.”).
This was done so that subjects developed a complete mental model of the system, as
emphasized in research comparing operators with and without mental, or “device,” models
(Kieras & Boviar, 1984). Although the training received by controllers and monitors was
different in the specific level of control, every effort was made to insure that all other
elements of the training (e.g., training time and level of explanation of the dynamic system)
were identical.
Results
Between- and within-subjects comparisons were made using signaled failure reaction time
(RT) and a combined RT and error rate measure for inferential failures. Analyses of
variance (ANOVA) were used to test for group differences and interactions for both
signaled and inferential failures. The combined performance measure for inferred RT was
used for the purpose of managing between-subjects variability common with complex
dynamic task performance (Parasuraman, 1986). Further, as discussed in the next section,
its use was based on precedent (Kessel & Wickens, 1982; Wickens & Kessel, 1979-1981)
and on theoretical grounds (Gai & Curry, 1976).
The use of an efficiency index is based on the assumption that, “subjects aggregate evidence
over time concerning the discrepancy between the sampled-system behavior and the
internal model of a non-failed system, until this evidence exceeded an internal decision
54
criterion. Detection efficiency is reflected in the rate of aggregation of internal evidence,
independent of the criterion setting,” (Wickens & Kessel, 1980, p569). Therefore, because
detection efficiency is both fast and accurate, it should be reflected in an index integrating
both measures.
Although indexes used in earlier research have been described as “somewhat arbitrary,”
(Wickens & Kessel, 1980, p569), every effort was made here to remove the arbitrary nature
of the weighting method, while still combining the measures and reducing overall RT
variability. Therefore, it was decided that a weighting scale would be used in which RT
was divided into either fast (>8000ms) or slow (<7999ms), since eight seconds was very
close to the grand mean for the experiment. “Fast” RTs were scored as 1, RTs which were
“slow” were scored as 2, while misses were scored as 3. This created an ascending
combined RT/error rate scale in which optimal performance generated a 1 (every failure is
detected in less than 8 seconds), and bad performance received a 3 (every failure event was
missed). By using only three consecutive levels in the index, misses were appropriately
weighted as significantly worse than long “hits.” While the 16 second “hit” window is
somewhat arbitrary, it was considered acceptable because inferred failure detection in a real
task would be highly task and failure dependent. Further, detection performance in this
task is based on a continuum, so the actual window size is not particularly meaningful.
However, if this paradigm were based on a real operational task with real failures, this “hit”
window would take on considerable meaning. It was believed that this index successfully
reduced variability, yet was far less arbitrary than other weighting methods. The raw RT
and error rate data are provided in Appendix B.
55
Signaled Failures
Simple RT findings were generally contrary to expectation (see Figure 2.) The yoked
monitoring group was marginally faster at detecting simple failures than was the controller
group (974 vs. 1172 ms) when compared directly [F(1,24) = 3.3, p < .1] in Session 1. The
optimized auto-pilot group was not significantly faster than the controllers (1120 vs. 1172
ms) nor significantly different from the yoked group (1120 vs. 974 ms).
In the transfer condition (Session 2), the yoked group was marginally faster than the
controllers (873 vs. 1103), [F(1,24) = 2.01, p < .15]. Although this effect is weak, it is
reported because it is highly contrary to expectation. The auto-pilot group was also quicker
than the controllers (939 vs. 1031 ms), although this effect was not significant. The
difference between the yoked group and the optimized auto-pilot group (873 vs. 939 ms)
was also not significant. (See Figure 2.)
1200
1100
Signaled Failure RT
1000
900
Controllers
800
700 Auto-pilot
600 Yoked
500
400
300
200
Day 1 Day 2 Auto-pilot Day 2 Yoked
Figure 2. Experiment 1, Signaled Failure RT.
56
Inferential Failures
Figure 3 shows inferential failure detection results. Controllers were significantly better at
detecting inferential failures than the optimized auto-pilot group in Session 1, (2.535 vs.
2.676), [F(1,26) = 4.43, p < .05], but not significantly better than the Yoked group, (2.535
vs. 2.65). The optimized Auto-pilot group was not significantly different from the Yoked
group, (2.676 vs. 2.65).
In the transfer condition, in which all subjects performed both of the auto-pilot tasks,
controllers did not perform significantly better than either of the auto-pilot groups, although
all means were in the anticipated direction. The Controllers, when compared to the
optimized Auto-pilot group, were not significantly different (2.619 vs. 2.699), nor were the
Controllers different from the Yoked group when compared on the yoking task, (2.577 vs.
2.593), [F(1,25) = .58]. Interestingly, the Yoked group was better than the Auto-pilot
group at the Auto-Pilot task in the transfer condition (2.66 vs. 2.7) although this difference
was not significant. (See Figure 3.)
2.7
Inferred Failure Detection
2.65
2.6 Controllers
Auto-pilot
2.55
Yoked
2.5
2.45
Day 1 Day 2 Auto-pilot Day 2 Yoked
57
Figure 3. Session 2 Inferential Failure detection performance (combined index).
Discussion
The results of Experiment 1 only partially supported the experimental hypotheses, yielding
both expected, and unexpected findings. The finding most consistent with previous
research was that subjects who controlled were significantly better at detecting inferential
failures than were the Auto-pilot monitors, and marginally better than Yoked monitors in
Session 1. This finding, although consistent with past research showing that controllers,
when compared directly with monitors, are better at detecting failures, was not entirely
predicted from the hypothesis given that the higher workload levels present when
controlling could have interfered with failure detection. In this particular task, the
proprioceptive feedback was limited to information pertaining to fuel consumption, thus
only indirectly related to failures. In the experiments using tracking tasks, however,
proprioceptive feedback was a direct indication of system failure and therefore a highly
salient cue. Proprioceptive feedback is therefore not considered a distinct advantage for
controllers in this task.
Subjects could “hypothesis test” in a failure condition as in past research using tracking
tasks, and this may have been a distinct advantage for Controllers. When Controllers
sensed illogical system behavior, they could test their hypothesis through pump or throttle
manipulation to see if their own inputs resulted in continued illogical system behavior. Post
experiment interviews indicated that Controllers took advantage of “hypothesis testing”
when detecting failures. In addition, although the Controllers’ overall failure detection
performance was marginally better than the Yoked group, there was a slight and
nonsignificant speed accuracy trade-off in Session 1 (see Appendix B) which may have
58
been a result of Controllers taking the extra time to hypothesis test prior to signaling a
failure, causing their RT to be slightly greater and their accuracy significantly better. It is
also possible that Controllers took advantage of a more activated mental model of the
system, and were thus more sensitive to illogical system behavior. However, when
comparing controllers to monitors, as in Session 1, it is difficult to determine the role of the
operator’s mental model activity given the other possible explanations for this advantage.
Reaction times for the signaled failures were generally consistent with the hypotheses.
Although the detection of signaled failures is hypothesized to be unrelated to mental model
activity in this experimental paradigm, it is likely that signaled failures are an effective
measure of workload. In fact, signaled failure RT was marginally faster for Yoked
monitors than Controllers, probably reflecting the lower workload levels for the Yoked
group. It is also possible that the greater latency for Controllers may have been the result of
subjects’ need to scan a greater portion of the display in order to perform the sub-task of
matching the throttle with the recommended level. Thus, it is possible that the shorter RTs
of the Yoked group were because subjects spent more time focused in the center of the
display where both failure types occurred, rather than switching their focal point to the
throttle display area on the periphery of the display. Although the Auto-pilot monitors also
had lower workload levels than the Controllers (and perhaps even lower than the Yoked
monitors) their signaled failure RTs were not significantly slower than the Controllers.
Although somewhat surprising, it is consistent with a pattern of marginal performance in
all conditions associated with the auto-pilot monitoring task. Although not central to this
research, the weak auto-pilot performance will be discussed further in the following
paragraphs.
59
Results from Session 2 did not generally support the hypothesis that system controllers are
better monitors than subjects who monitored in Session 1. Although group means were in
the predicted direction, there were no significant differences between the Controllers and
the Auto-pilot or Yoked monitors. Controllers were slightly better than the Auto-pilot
group when transferring to the yoked condition, but this difference was marginally
significant at best (p < .13). Further, this result is more likely a result of very poor
performance by the Auto-pilot group in the Yoked condition, rather than good performance
by the Controllers. The only significant finding for Session 2 was that the Yoked group
performed better than the Auto-pilot group when transferring to the yoked condition. This
finding was not surprising given that the Yoked group had previous experience with the
yoked condition and the Auto-pilot group did not. However, one would expect the opposite
results in the auto-pilot condition.
There are several possible explanations for the Controllers’ failure to perform significantly
better on inferred failure detection tasks than the two monitoring groups in Session 2.
While past tracking task research used two days for Session 1, not including training
(Kessel & Wickens, 1982; Wickens & Kessel, 1979-1981; Young, 1995), I believed that the
cognitive nature of this paradigm would allow it to be learned more quickly than the subtle
motor skills required in a difficult tracking task. This assumption was incorrect, however,
as post-experiment interviews and experimental data suggested that the task was actually
quite difficult to learn and perfect, and that the training time had not been sufficient for
subjects to master the task. In fact, some subjects suggested that they were still learning the
task well into Session 2. Further complicating this picture is the fact that the high
workload in the controlling condition (in both training and during Session 1) may have
made it more difficult for Controllers to learn the task as compared to the two monitoring
groups. Given that the group means were in the predicted directions, it is possible that the
60
Controllers did have an advantage in detecting inferential failures but, because of the
learning issues, this difference was not strong enough to generate a significant effect.
The experimental hypotheses stated that there would be no effect of prior experience for
signaled failure reaction times in Session 2. This prediction was based on the theory that
any advantage during monitoring afforded to past controllers was due to a more activated
mental model, and thus would not affect signaled failure detection performance. However,
RTs for signaled failures were marginally affected by condition. The Yoked monitor group
was faster (at a marginally significant level) than the Controller group while performing the
yoked monitoring task. The Auto-pilot monitors were also faster than the Controllers,
although not significantly. Although the difference between Controllers and Yoked
monitors was not significant, it is unexpected and therefore quite interesting, and will be
explored further in the subsequent experiment. The most salient explanation for this
finding seems to be that Controllers scan the display more diligently than the Yoked
monitors, and therefore spend less time focused on the center of the display where the
signaled failures occur.
The theory that system controllers scan more effectively is further supported by the fact that
Controllers may have performed better on the inferential failure detection task. This
improved failure detection performance could have been the result of Controllers
integrating subtle cues from the system more effectively and therefore being more sensitive
to system abnormalities. Importantly, this integrating process would likely use information
from the throttle display in forming a diagnosis. It therefore seems that if Controllers are
more sensitive to the system operations as an integrated unit, they spend more time focusing
on the throttle display accessing the important throttle information and less time focused on
the center of the display.
61
Implications of this finding are that subjects who have controlled, and are likewise
benefiting from controlling experience while monitoring, seem to be spending more time
scanning the display for useful information. Given that the throttle provides subtle clues
about system behavior, the Controllers should have derived a failure detection advantage if
they were allocating more time to studying its impact on the system. However, the
Controllers were not significantly better on the inferential failure detection task than the
other groups. This implies that the information provided by the throttle was not valuable
enough to improve inferential failure detection performance for those who observed it. It is
possible that while throttle information may have been advantageously used by Controllers,
other factors offset this advantage.
This explanation may provide some clues as to why accidents involving controlled flight
into terrain with auto-pilot engaged were not detected by the pilots even though there was
ample evidence of impending disaster. This finding also suggests that if one of the benefits
of controlling is better scanning performance, measurement of eye location and movement
may be diagnostic of the effects of the “out of the loop” performance problem suggested by
Smolensky (1993). It is also possible that Controllers, because of their extensive experience
with controlling the throttle, gained a greater understanding of the relationship between the
throttle and fuel system behavior, and were hence more inclined to observe throttle activity
even when monitoring the system. This explanation supports the contention that active
controllers scan more while monitoring because they have developed a different strategy,
either implicitly or explicitly, for detecting inferential failures.
As mentioned previously, two different monitoring groups were used to address concerns
that the experimentally superior “yoked” auto-pilot method may induce differences
62
compared to an “optimized” type auto-pilot. However, even the optimized automation is
completely task dependent. The optimized system for this task was based on the view that
aircraft automation is extremely consistent and rigid in the way it controls the various
systems. Therefore, the optimized system for this task consistently held fuel levels in the
“safe” zones, and operated the pumps in a rigid operational sequence to maintain correct
fuel levels. Additionally, the throttle setting was automatically maintained at the level of
the recommended throttle position.
Results from the experiment suggest that the hypothesized effects are not unique to the
“yoked” condition. In fact, in nearly all conditions the performance of the “optimized”
Auto-pilot group was worse than the Yoked group. This suggests that the effects found in
this paradigm are not due to the use of the yoked methodology. In fact, results obtained
using this method may underestimate effects found in applied settings due to the prevalence
of “optimized” automation in operational systems. Since this difference is not the focus of
this research, nor was it consistently significant, it will not be explored further. However, it
may be worth noting that the highly consistent behavior of the optimized system may have
had a numbing effect on subjects, thus pushing them even farther out of the control loop, or
so they perceived, and reducing mental model activation even further. In addition, it is also
possible that the consistency of the automation made them believe the task was easier,
compared to the Yoked monitors, who had to pay close attention to the automated system in
order to know what it was doing.
This view is supported by the fact that in Session 2 the Auto-pilot group had marginally
poorer performance on inferred failure detection tasks in the yoked condition, compared to
Controllers, and significantly worse performance than the Yoked monitors. Thus, when
transferring to the more difficult yoked monitoring task, the optimal Auto-pilot group was
63
at a further disadvantage as a result of their experience in the highly predictable optimized
auto-pilot condition. Further, there are indications that the Yoked group performed better
at the optimized auto-pilot monitoring task than the Auto-pilot group, even though this
condition was novel to them. This suggests that some feature of the yoked monitoring task
made Yoked monitors more sensitive to system behavior, thus giving them an advantage
over the optimized group even in the optimized auto-pilot condition.
Experiment 1 Conclusions
The results of this experiment were informative and suggestive. Controllers, when
compared to both monitoring groups in Session 1, seemed to be better at detecting
inferential failures. This finding likely reflected “hypothesis testing” and perhaps improved
mental model activation by Controllers. Signaled failures, however, produced generally
opposite results reflecting the higher workload of controlling, and the necessity of
Controllers to observe the throttle for control purposes resulting in less time spent focused
in the center of the display.
inferential failure detection means were in the predicted directions during Session 2,
although these differences were not significant. This finding suggests that a transfer effect
for Controllers may exist, but the experiment as conducted lacked power. The results of
signaled failure performance in Session 2 were surprising, perhaps reflecting the fact that
subjects with experience controlling scan the display more effectively, supporting previous
contentions that mental models play a role in guiding perceptual activity (Endsley, 1995).
However, this difference could also be due to Controllers developing a failure detection
strategy more dependent on the effect of throttle behavior. In either case, Controllers seem
64
to spend more time focused on the throttle while monitoring, and less time focused on the
center of the display.
Implications for Experiment 2
The primary goal of Experiment 1 was to replicate earlier findings that controllers of a
dynamic system are better at detecting system failures than subjects who only monitor when
both groups transfer to a monitoring condition. The intended contribution of Experiment 1
was to replicate these findings using a cognitively complex dynamic system management
paradigm. Although the experiment yielded interesting results, the primary objective was
not achieved. I believe this failure was a result of an insufficient manipulation of
conditions, not the result of non-existent effects.
Experiment 2 was designed to both correct the weaknesses of Experiment 1, and to further
explore the surprising findings from the signaled failure detection task. In addition, the
“optimized” auto-pilot condition was dropped from Experiment 2, since the predicted effect
seems to be present using both auto-pilot types and the yoking methodology is
experimentally superior.
A major difference between Experiment 1 and Experiment 2 is the inclusion of an
additional day for Session 1. The extra day was added to address the anecdotal and
experimental evidence suggesting that subjects were still learning the task well into the
second day (Session 2). I am interested in the transfer effects from a well learned task, and
it is therefore imperative that the task be well learned before subjects switch to the transfer
task. In addition, some subjects suggested in post-experiment interviews that they were
65
confused by the triggering system and the consequent lack of performance feedback, and
that this confusion further hampered their ability to quickly learn the task. Subjects
indicated that because of the subtle nature of the inferential failures, even though the failure
behavior ceased after detection, it wasn’t always clear if a failure had been successfully
signaled. This confusion was exacerbated by the fact that a “false alarm” deactivated the
trigger, so that when subjects positively identified a failure in the same trial, trigger
activation had no effect, leading them to believe that they had improperly diagnosed the
failure.
To address this confusion, a message system was added to the display to inform subjects of
both the state of the trigger (armed or not) and whether or not they had correctly identified
a failure. Not only did this procedural change augment the performance information that
subjects generally assumed on their own but, more importantly, prevented any false
learning resulting from system state misinterpretation. Although this feedback could be
criticized on the grounds that better performers would receive more positive feedback, this
method afforded users the opportunity to learn from both correct and incorrect performance.
Although there was implicit feedback in Experiment 1, it favored individuals with better
performance to an even greater degree, since good failure detection performance likely
meant better system understanding. Therefore, improved system understanding not only
led to better performance, but also more accurate interpretation of implicit system feedback.
The intentional system subtleties included in Experiment 1 were carried over into
Experiment 2, but were exaggerated somewhat to further occlude the inferential failures.
Failure onset was made more subtle, and pump flow-rate differences were exaggerated
slightly. Most importantly, the non-linear relationship between the throttle level and the
rate of fuel consumption was exaggerated, and the recommended throttle level changed
66
positions at a greater frequency, making throttle level monitoring (and controlling) more
demanding. This procedure was used because in Experiment 1 Controllers may have been
spending more time scanning the throttle display while monitoring. If knowledge of
throttle activity is now more important for inferential failure diagnoses, then scanning
behavior by Controllers should result in greater positive impact on their inferential
detection performance.
To test the hypothesis that Controllers may have poorer signaled failure detection
performance because they spend more time scanning the display for throttle information,
the throttle information was both moved farther to the edge of the display and made slightly
less salient. Both changes were made to increase the time required to effectively scan the
throttle information. This change should exaggerate the signaled failure detection
differences seen in Session 2 of Experiment 1, if scanning strategies were responsible for
this difference.
To further explore this issue, the throttle was removed from the display on half of the trials
in Session 2. If Controllers’ poorer signaled failure detection performance was due to
differences in scanning behavior, then their signaled failure detection performance should
improve, mirroring that of monitors when there is no throttle present. Likewise, if
Controllers are using the throttle information to facilitate inferred failure detection, then the
removal of this information should reduce their inferential failure detection advantage.
67
Experimental Hypotheses, Experiment 2
Experiment 2 was designed to correct the shortcomings of Experiment 1 and as a tool to
further explore unexpected findings from Experiment 1. I expect the direct comparison
between Controllers and Monitors in Session 1 to again show a small advantage for
Controllers in the inferential monitoring task, as a result of hypothesis testing and perhaps
improved mental model activation, but a disadvantage in the signaled failure detection task
due to higher workload and the need to spend more time focused on the throttle display.
Controllers should also show an advantage over Monitors in the inferential failure detection
task in Session 2, supporting the hypothesis that the heightened activation of the
controllers’ mental models makes them more sensitive to inferential failures when
transferring to the monitoring task.
However, this advantage in inferential failure detection may only be present in the “throttle
visible” condition (see Method). If the activated mental model guides perception (Endsley,
1995), and attention is thus directed to the throttle information on the display because it
provides relevant data for inferring abnormal operation, the absence of throttle information
should impair the failure detection advantage of Controllers. Further, if the poor signaled
failure detection performance in Session 2 by Controllers in Experiment 1 was a result of
their scanning behavior, then the “throttle not visible” condition in Experiment 2 will show
an improvement in signaled failure detection performance since effective scanning will no
longer include items in the periphery of the display.
68
EXPERIMENT 2
Method
The Methods section for Experiment 2 highlights only differences from Experiment 1.
Subjects
Thirty-eight volunteer, right-handed, male university students from Introductory
Psychology courses were used in the experiment. Students received “experimental credit”
and were paid a base rate for their participation in the experiment. Additionally, subjects
were given the opportunity to earn a five dollar bonus for good performance.
Task
The task for Experiment 2 was the same as that used for Experiment 1 except for the
following changes: A trigger and performance feedback message was added to the lower
right corner of the display to address the confusion about system state expressed by subjects
in Experiment 1. “Trigger Armed,” “False Alarm, trigger INOP until reset,” “Failure
Detected_system resetting,” and “Miss_system resetting.” messages appeared in accordance
with the system state. In addition, the messages were color coded to heighten awareness of
changes in the system state.
Two changes were made to the throttle portion of the display. First, the throttle was moved
farther toward the upper right hand corned of the display and the “recommended throttle
position” was made less salient by decreasing the width of the indicator bar. Both of these
changes were made to increase the time required to scan the throttle-setting portion of the
69
display. In a related change, the digital aircraft speed was moved from the forward-center
location of the aircraft to the upper left-hand corner of the display. This was done to further
increase the time needed to effectively scan all information components of the display. The
second major change was the removal of all throttle information on half of Session 2 trials.
This change eliminated the need for subjects to scan the periphery of the display, but also
removed information which may have helped them in the inferential failure detection
process.
In order to further occlude normal system operation and thus complicate the inferential
failure detection process, individual pump flow rate differences were exaggerated, the
linearity of the throttle level/fuel flow ratio was degraded, and inferential failures
themselves were made slightly harder to detect. The final change to the task for
Experiment 2 was that the time given to detect a pump failure was reduced from 5 seconds
to 3.5 seconds because the results of Experiment 1 suggested that the extra 1.5 seconds was
unnecessary.
Experimental Design
Two groups participated in the transfer of training, between-subjects design. The first
group controlled the system during the first and second days of the experiment (Session 1)
while the second group monitored a “yoked” auto-pilot during Session 1. On the third day
(Session 2), the transfer condition, both groups monitored the yoked condition and detected
failures. However, in two of the four trials, the throttle information was eliminated from
the display. (See Figure 4.)
70
Participatory Mode, Experiment II
Session 1 Session 2
Day 1,2 Day 3, (Monitoring)
Throt Vis
Controllers
Control
Throt NotVis
Throt Vis
"Yoked" monitors
Monitor
Throt NotVis
Figure 4. Experimental design, Experiment 2, Session 2 counterbalanced by condition.
Training
The training session was 30 minutes at the beginning of Day 1, and was identical to
Experiment 1 except that subjects received performance feedback throughout training.
Results
Between- and within-subject comparisons were made for both signaled failure RT and the
combined RT and error-rate measure for inferential failures used in Experiment 1. An
analysis of variance (ANOVA) was used to test for group differences and interactions.
Session 1 data were from Day 2 only unless otherwise specified, as Day 1 was treated as
learning. The raw RT and error rate data for Inferred failures are provided in Appendix C.
71
Signaled Failures, Session 1
Signaled RT findings generally supported the experimental hypotheses. As suggested by
the results of Experiment 1, subjects were still learning the task into the second day, as
demonstrated by the significant mean reaction time group effect from Day 1 to Day 2,
[F(1,36) = 13.6, p < .01]. Although the Group by Day interaction was not significant,
Controllers’ improvement was larger from Day 1 to Day 2 in Session 1 (960 vs. 835),
[F(1,17) = 14.45, p < .01], than Monitors (867 vs. 794), [F(1,19) = 3.09, p < .1]. Simple
comparison between Controllers and Monitors in Session 1 (Day 2) was in the predicted
direction but was not significant (835 vs. 794), [F(1,37) = .27].
Signaled Failures, Session 2
Session 2 yielded surprising findings for signaled failures. There was a main effect
favoring Controllers over Monitors, [F(1,36) = 4.75, p < .05], and as shown in Figure 5, a
marginally significant Group by Condition (Visibility) interaction [F(1,36) = 2.39, p <
.15]. There were no significant group differences in the throttle Visible condition (667 vs.
728), [F(1,37) = 1.21], but there was a significant difference in the throttle NotVisible
condition (604 vs. 757), [F(1,37) = 6.62 , p < .05]. As expected, Controllers improved
from the throttle Visible to the throttle NotVisible condition (667 vs. 604), [F(1,17) = 6.64,
p < .05], while the Monitor’s mean RT increased, but not significantly (728 vs. 757). (See
Figure 5.)
72
1000
Signaled Failure RT
900
800
700 Controllers
600
500 Monitors
400
300
200
Day 1 Session 1, Session 2 Session 2
Day 2 Visible NotVisible
Figure 5. Session 2 Signaled Failure RT.
Inferential Failures, Session 1
Inferential failure detection performance strongly supported the experimental hypotheses.
There was a significant effect for Day in Session 1 favoring Day 2 [F(1,36) = 15.2, p <
.01] with no significant Day by Group interaction, supporting the contention that both
groups were still learning the task after the first day. This is also supported by the false
alarm data which showed a significant reduction by day, [F(1,36) = 20.4, p < .01], and no
interaction.
Controllers had a lower mean (better performance) than did Monitors in Session 1 (Day 2),
but it was not significant [F(1,37) = .18]. As in Experiment 1, there was a slight non-
significant speed/accuracy trade-off in this condition (see Appendix B), favoring better
accuracy for Controllers, but slightly slower reaction time.
Inferential Failures, Session 2
Controllers in Session 2 had significantly better inferential failure detection performance
than monitors, but only in the throttle Visible condition (1.9 vs. 2.15), [F(1,37) = 4.19, p <
73
.05]. The mean performance score for Controllers was better than the Monitors, but not
significantly (2.01 vs. 2.04), [F(1,37) = .1]. Although there was no group effect favoring
Controllers over Monitors by condition, there was a marginally significant interaction (see
Figure 6), [F(1,36) = 3.67, p < .1], resulting from Controllers having poorer performance
in the throttle NotVisible condition compared to the throttle Visible condition (1.9 vs.
2.01), [F(1,17) = 2.35, p < .15], while Monitors performed better in the throttle NotVisible
condition, although this difference was not significant (2.15 vs. 2.04), [F(1,19) = 1.55].
2.6
Inferred Failure Index
2.4
2.2
Controllers
2
Monitors
1.8
1.6
1.4
Day 1 Session 1, Session 2 Session 2
Day 2 Visible NotVisible
Figure 6. Session 2 Inferential Failure detection performance (combined index).
Discussion
The results of Experiment 2 strongly support the experimental hypotheses, with few
surprises. Improvements in signaled and inferential failure detection performance from
Day 1 to Day 2 in Session 1 support the belief that Experiment 1 subjects either had not
74
learned the task or were not proficient at the task by the end of Day 1. Further, these data
may underestimate the proficiency deficiencies of subjects after Day 1 in Experiment 1,
given the additional feedback provided to subjects in Experiment 2 which likely facilitated
task acquisition.
Signaled failure RT differences in Session 1 of Experiment 1 were marginally significant,
but they were not significant in Experiment 2, although means in both experiments were in
the same direction. This may be a reflection of the fact that by Day 2 workload levels were
probably more similar between the two groups than in Experiment 1, as the additional
practice afforded by Day 1 may have reduced the workload levels for Controllers on Day 2.
This contention is based on the premise that workload in Session 1 in Experiment 1 was a
result of both having to learn to detect failures and learn how to control the system, in
addition to manually controlling the system, the latter two tasks not being applicable to
system monitors. However, in Experiment 2, much of the learning had already taken place,
leaving workload differences between the two groups a result only of the need to manually
control the system for Controllers.
Session 2 signaled failure detection performance supported the hypothesis that Controllers
scan the display more effectively than do the Monitors. Two features of the signaled failure
detection performance support this contention. First, and most importantly, is the fact that
there is a significant difference for Controllers between the throttle Visible and throttle
NotVisible condition, yet there is no such difference for Monitors. This is supported by
both the within-subjects’ comparisons and the marginally significant group interaction of
throttle visibility. This finding suggests that the Controllers scanned the peripherally-
located throttle information to facilitate inferential failure detection when the throttle was
present on the display. This scanning of the throttle information necessarily meant a cost in
75
RT for detecting signaled failures. Thus, in the throttle NotVisible condition, the
Controllers did not have the option of scanning the peripherally located throttle
information, and their signaled failure RT decreased significantly because attention was
focused only in the center of the display. Likewise, the Monitors signaled failure detection
performance remained unaffected by the presence or absence of throttle information,
suggesting that there was little, if any attention allocated to it when it was present.
Surprisingly, there was a significant group effect for signaled failure RT, and a marginally
significant interaction. This finding is contrary to the marginal effect found in Experiment
1 in which Controllers were slower than Yoked monitors. However, in the equivalent
throttle Visible condition in Experiment 2, there was no significant difference. This leads
to the speculation that the Experiment 1 finding was a statistical artifact. However, in the
throttle NotVisible condition, the Controllers were significantly faster than the Monitors,
leading to the significant group effect. This finding is contrary to expectations, as the
hypothesis was that these two groups should have performed similarly on the signaled
failure detection task, as both groups were focused similarly in the center of the display.
Although this finding is problematic for the hypothesis that the controller advantage while
monitoring is due to a higher activation state of the subject’s mental model, there are two
likely explanations which are consistent with the theory. The first is that Controllers
benefit from a more activated mental model of the system and that this activation not only
enhances their ability to perceive, integrate and analyze features of the task with greater
efficiency, but spills over such that even simple stimuli are perceived and responded to
more efficiently. The second possible, but less likely, explanation is that Controllers were
frustrated by the lack of throttle information in the throttle NotVisible condition and were
thus channeling extra effort into the task. While this extra effort did little to enhance
76
inferential failure detection, it did result in significantly better signaled failure detection
performance.
The pattern of outcomes for inferential failure detection conformed to the experimental
hypotheses. As with signaled failure detection performance, there was a significant effect
of Day in Session 1, with no group interaction. This reflected the fact that both Controllers
and Monitors were still learning the task into the second day. In Day 2 of Session 1, mean
performance for Controllers was better than the Monitors, but this difference was not
significant. This finding reflects a consistent trend in this paradigm that when compared
directly, there is little or no advantage for Controllers. Similar to Experiment 1, Session 1
inferential failure detection performance yielded a marginal speed-accuracy tradeoff, with
accuracy in favor of Controllers. This is likely a result of Controllers taking the time to
manipulate the system in order to “hypothesis test.” While hypothesis testing generated
more accurate performance, there was some cost in RT. However, none of these differences
(reaction time or accuracy) were significant when compared directly. The fact that both
Experiment 1 and Experiment 2 generated the same trade-off in Session 1 implies that this
is a true effect. Further, Wickens and Kessel (1979) found the same trade-off when
comparing monitors and controllers directly. In addition, it is important to note that
Controllers had slightly higher workload than Monitors in Session 1, which seems not to
have an effect on the Controllers’ inferential failure detection performance. This finding
further supports Wickens and Kessel (1980) who found that workload resulting from
manual response organization and execution (e.g., manual tracking) may not compete with
resources allocated to perceptual encoding and memory.
While results from tracking-task experiments suggest that the proprioceptive feedback from
tracking improves performance for Controllers, such direct feedback about system behavior,
77
specifically system failures, was not available proprioceptively to Controllers in the fuel
management paradigm. However, response related information might result from the act of
controlling the throttle and manipulating the fuel pumps, thus instantiating the “state” of
the system for Controllers. While this response information is certainly not as diagnostic
about system state as the proprioceptive feedback from tracking, it may serve a similar role
in updating the operator’s mental model of system activity (i.e., the dynamic execution of
the operator’s mental model), thus off-setting any performance deficits due to higher
workload. There are thus two non-competing explanations for the lack of effect of higher
workload for controllers. Either the resources required to control the system are different
than those required to detect subtle inferential failures, or, information obtained or
reinforced from the act of controlling made the task of detecting failures easier, and
therefore more resource efficient even though the resources were the same.
Results from Session 2 supported the experimental hypotheses and successfully replicated
previous findings of Controller superiority in the monitoring task. In the throttle Visible
condition, the Controllers had significantly better inferential failure detection performance
than Monitors. While this finding supports the hypothesis that controlling a system causes
one to be a more effective monitor of inferential failures, it is made more diagnostic by the
fact that no such advantage for Controllers exists in the throttle NotVisible condition.
There was no significant difference between Controllers and Monitors in the throttle
NotVisible condition, and the ability of Controllers declined slightly from the Visible to the
NotVisible condition. There was also a marginal Group by Visibility interaction, reflecting
a worsening of performance for Controllers, but a non-significant improvement in
performance for Monitors from the Visible to NotVisible condition.
78
While the intent of the throttle visibility manipulation was to affect signaled failure
detection performance, which it did, it was unknown whether the removal of throttle
information would actually hinder performance. In theory, the throttle display provides
information which is useful, but not critical, in diagnosing inferential failures. However,
data from Experiment 1 seemed to indicate that while Controllers were focusing more on
throttle information than Monitors, it did little to help them in detecting inferential failures.
However, in Experiment 2 the throttle mechanism was altered to make it a more valuable
information component in the detection of inferential failures. It appears that this change,
in combination with the increased proficiency gained from the additional day in Session 1,
caused individuals who focused more on the throttle information to have a distinct
advantage in detecting inferential failures.
Importantly, the effect of throttle visibility suggests that scanning the throttle information
was the critical behavior that enhanced Controller performance. This finding is easily
interpreted through Endsley’s (1995) view of the role of the well developed, or highly
activated, mental model of the behavior of a particular system. Endsley (1995, p.43)
suggests that this model, “provides (a) knowledge of the relevant elements of the system
that can be used in directing attention and classifying information in the perception process,
(b) a means of integrating the elements to form an understanding of their meaning, and (c)
a mechanism for projecting future states of the system based on its current state and an
understanding of its dynamics.” Viewed in the context of the current dynamic execution
theory of mental models, these data can be interpreted to suggest that the Controllers,
because of their activated mental model, direct attention to the throttle mechanism, given its
diagnostic importance in detecting failures, and then successfully integrate that perceptual
information with other momentary system attributes to successfully detect failures. When
the throttle information is not visible, this perceptual and computational advantage goes
79
unused, as is indicated by the non-significant performance difference between Controllers
and Monitors in the throttle NotVisible condition. Although Controllers may have had
some advantage, as seen in the mean difference favoring Controllers, this advantage was
too slight to be significant.
Conclusions and Implications
Experiment 2 successfully supported the hypothesis and replicated findings that controllers
are better at detecting failures when transferring to a monitoring task than subjects who
monitor in both conditions. Further, the hypothesis that controllers may scan the display
more in an attempt to perceive task-relevant stimuli was also supported by the fact that
controllers in this experiment were significantly poorer at detecting signaled, centrally
located failures when relevant system information was present in the periphery of the
display. In addition, it appears that the Controllers not only scanned the display for
information, but they perceived and integrated it more efficiently than monitors and were
thus more effective at detecting inferential failures. The only surprise was that Controllers
were, on whole, better at detecting signaled failures than were system monitors, suggesting
that there may be some carry-over effect from an activated mental model which is only
indirectly related to the signaled failures.
Several practical and theoretical implications can be drawn from these findings. Most
importantly, the transfer advantage of controllers over monitors was replicated using a more
realistic, cognitively complex dynamic task. The similarity of this paradigm to other
dynamic systems, and the convergence of these data with past findings supports the
contention that experience controlling a system (being “in the loop”) provides advantages to
operators when they must passively monitor the system. These findings also suggest that
80
controlling the system may make monitors more sensitive to system variability, and
especially to those features of the system which were controlled in the past. This strongly
supports concerns by Moray (1986) that there may be serious consequences when operators
learn to monitor a system without ever having controlled the system. Perhaps, in such
learning environments, the relationships between system variables are simply not
understood or appreciated to the same degree as when one must manually control system
variables. This is especially significant, given the suggestion that pilots transitioning into
highly automated aircraft have little opportunity to acquire or practice manual flying skills
in the aircraft (Orlady & Wheeler, 1989).
I predicted that signaled failure detection in Session 2 should remain unaffected by
manipulations except for the Controllers in the throttle Visible condition. The data,
however, showed that Controllers across the Visibility condition were significantly faster
than monitors, with the most significant difference being in the throttle NotVisible
condition. While this can be interpreted in a manner which does not contradict the
hypothesis, it may be viewed as somewhat problematic for a hypothesis that states that the
controller advantage is a result of an activated mental model of the underlying system
behavior. As mentioned previously, however, it is possible that this improved signaled
failure detection performance is a vigilance carry-over effect from an activated mental
model. This would imply that a well-activated mental model not only guides perception to
critical features of that system, but it may also affect perceptual sensitivity to features
generally unrelated to underlying system behavior.
This experimental design does not preclude the possibility that controllers and monitors
develop slightly different mental models of the dynamic system, despite every effort made
in training to prevent it. While the controller’s mental model obviously contained an actual
81
motor-control component, it was believed that both groups would likely develop the same
underlying operational understanding of the system, and thus the same mental model for
use in inferential failure detection. It is possible, however, that the act of controlling in
Session 1, either through a more active learning experience, or by the reinforcing of certain
system-variable relationships resulting from controlling those variables, may have caused
the development of slightly different mental models. While this does not exclude a mental
model activation theory, it does suggest that Controllers may have a more activated, but
different mental model at their disposal.
Although I believe that this experimental design is highly valid for operational
environments in which training departments have the choice of training future system
monitors either in an automatic-only or with a hands-on control methodology, it lacks
ecological validity in the current aviation context. All pilots of highly automated aircraft
have considerable hands-on flying experience, although it may be infrequent in day-to-day
commercial operations as suggested by Orlady and Wheeler (1989). Young (1969) and
Wickens and Kessel (1979) used a repeated measures design so that all subjects both
controlled and monitored. While this design generated failure detection performance
differences between system controllers and monitors, it was impossible to determine the
degree to which a more consistent internal model of the system contributed to the
controllers improved performance, as compared to other factors (e.g., proprioceptive
information). Kessel and Wickens (1982) thus employed a between-subjects, transfer of
training design to address the problem.
Given the success of the current experiment in replicating Kessel and Wickens (1982), I
feel that a return to a repeated measures design using this cognitive dynamic task would
offer several distinct advantages for answering additional questions generated by this
82
experiment. First, a repeated measures design controls for the large between-subjects
variability found both in this experiment and typically in complex vigilance tasks
(Parasuraman, 1986). More importantly, however, it insures that all subjects develop the
same mental model of the system. While this feature was problematic for Wickens and
Kessel (1979), the fact that proprioceptive feedback is not a direct indication of system
failure in the current paradigm makes this a less pervasive problem. Further, a repeated
measures design has more ecological validity, helping to answer the question of whether
temporarily controlling a cognitive dynamic system subsequently makes one a better
monitor, as indicated by Parasuraman, Mouloua, and Molloy (1996).
Experiment 3 Experimental Hypotheses
Experiment 3 uses the same dynamic fuel management task as in Experiment 2 but with a
repeated measures design. This design was altered so that all subjects were trained in the
controlling task and given sufficient time to become proficient at the task (two days in
addition to the training session, as in Experiment 2). Subjects then monitored and detected
failures for the next four days except for two trials on either days five or six. The subjects
failure-detection performance for both failure types was then compared for the two trials
following the controlling re-introduction to the same two trials after continued monitoring
on the non-controller re-introduction day.
As in the previous experiments, it is hypothesized that controlling the system would cause
improved failure detection performance for subsequent monitoring. Further, this
improvement should be primarily in inferential failure detection, as subtle system operation
is hypothesized to be more sensitive to the level of activation of the operator’s mental
model. Given the superior ecological validity of this design for aviation operations, the
83
hypothesized improvement of performance has strong implications for the value of
controller reintroduction for enhanced monitoring performance.
84
EXPERIMENT 3
Method
The Methods section for Experiment 3 highlights only differences from Experiment 2.
Subjects
Fifteen right-handed male university students were used in the experiment. Students were
paid a base hourly rate for their participation in the experiment. Additionally, subjects were
given the opportunity to earn a higher hourly rate for good performance.
Task
The task for Experiment 3 was the same as that used for Experiment 2 except for the
following change:
A message box was added to the lower left corner of the display informing subjects of the
participatory mode. The message stated either “Automatic control,” or “Manual control,”
and the messages were displayed in different colors to help alert subjects to any change in
participatory mode. In earlier experiments, participatory mode was described in the
training session prior to that day’s task, so no message system was necessary.
Experimental Design Considerations
A completely within subjects’ transfer of training design was used to address the large
between-subjects’ differences found typically in complex vigilance tasks (Parasuraman,
85
1986) and in the previous two experiments. All subjects learned the controlling task while
detecting both failure types, and proceeded to participate in the controlling mode for the
first two days. Subjects then spent the remaining four days in the monitoring mode, except
for the two 12 minute trials in which they were reintroduced to controlling.
Experiment 3 Pilot Study
Because of the potential confounding effects of trial and day in a within subject’s transfer of
training design, a pilot study for Experiment 3 was conducted to determine the best
sequence of conditions. The pilot study used four subjects who controlled the system and
detected failures for the first two days, then transferred to the monitoring mode on the Day
3. Subjects monitored the system and detected both failure types on Days 3 through 9.
Results of the Experiment 3 pilot study for trial effects showed a significant difference
between Trials 1 and 4 for inferred failures [F(1,4) = 9.9, p < .05], and a non-significant
difference in the same direction for signaled failures. There was a marginally significant
difference between Trials 3 and 4 for inferred failures [F(1,4) = 6.6, p < .1], but no
difference in means for signaled failures. There was no Trial by Day interaction indicating
a “by Day” stability for the observed trial effects. Importantly, failure detection
performance was stable in trials four and five for both inferred and signaled failures. (See
Figure 7.)
86
2.5
600
2
550
1.5 Signaled
500 Inf erred
1
450 0.5
400 0
1 2 3 4 5
Figure 7. Inferred and Signaled Failures by Trials, Days 4 - 7.
The pilot study results for Days revealed the typical trend of an improvement from Day 1 to
Day 2 (both controlling days) for both signaled and inferred failures as seen in Experiment
2. Further, on Day 3 (the first monitoring day), inferred failure detection performance
declined, while signaled performance increased. More importantly, however, both inferred
and signaled failure detection performance are stable by Day 4 and remained that way
through Day 7. (See Figure 8.)
800 2.4
2.2
700 2
600 1.8
1.6
Signaled
500 1.4
Inf erred
1.2
400 1
0.8
300
0.6
200 0.4
1 2 3 4 5 6 7 8 9
Figure 8. Inferred and Signaled Failures, Days 1 - 9.
87
Surprisingly, there was an improvement in inferred failure detection performance on Day 8,
without a concurrent improvement for signaled failures. While this initially appears
contrary to the hypothesis that monitors’ performances should decline after continued
monitoring, it more likely demonstrates a problematic feature of the experimental
paradigm. In a true operational environment operators would seldom see the same
“inferential” failure twice. Rather, an understanding of correct system operation would be
the basis for system failure diagnosis. But in this experiment, it appears that sensitivity to
one type of inferred system failure may become a factor after long duration interaction with
this system. By the beginning of the eighth day, system monitors had already observed 210
inferential failures, not including the training session on the first day. It is therefore quite
likely that the tremendous exposure to the inferential failures used in this paradigm actually
resulted in a subtle but distinct improvement in inferential failure detection performance as
monitors became sensitive to subtle system behavior, in essence “signaling” an inferred
failure.
Another explanation is that this improvement is due to subjects anticipating the end of the
experiment. However, this explanation is discounted because the increase occurred on Day
8, not Day 9, and there was actually a marginal decrease in performance from Day 8 to Day
9. Further, there was no concurrent performance increase for signaled failure detection on
Day 8.
Experiment 3 Design
Results from the pilot study revealed three significant design considerations for Experiment
3. First, given the stability of Trials 4 and 5 for both signaled and inferred failures, it was
determined that these trials would be the best for the between- and within-day comparisons.
88
This left the first 3 trials available for the requisite controller re-introduction. Secondly,
due to the stability in both signaled and inferred failure detection performance on Days 4
through 7, a six day experiment was chosen. This design allowed both sufficient
controlling experience by using Days 1 and 2 as controlling days, and also allowed a
sufficient continuous monitoring period of 2 days prior to controlling re-introduction.
Therefore, subjects controlled on either Day 5 or Day 6 (to counter-balance any potential
effect of Days) on Trials 2 and 3, and then monitored on Trials 4 and 5 (on either Day 5 or
Day 6 depending on the counter-balance). Comparisons were made between Trials 4 and 5
after controller re-introduction to Trials 4 and 5 after continuous monitoring. (See Figure
9.)
Participatory mode, Experiment 3
Day 1 Day 2 Day 3 Day 4 Day 5 Day 6
Training
Control Control Auto-Pilot Auto-Pilot Auto-Pilot Auto-Pilot
Control Control Auto-Pilot Auto-Pilot Control Auto-Pilot
Figure 9. Experiment 3 experimental design, participatory mode, counter-balanced on
Days 5 and 6. Comparison trials outlined in bold.
Given that the results of the pilot study suggested that continued monitoring performance
might result in an increased sensitivity to the inferential failures in this paradigm (as seen
on Days 8 and 9), Experiment 3 was designed to minimize subjects’ exposure to inferred
89
failures. In addition to experimental issues, this consideration was warranted by the fact
that lower inferred failure exposure increased external validity. Therefore, although the
Day 1 failure rate remained the same as in Experiment 2 (six signaled/six inferred), the
number of inferred failures experienced by subjects was decreased in the remainder of the
experiment. On Days 2 through 4, subjects were exposed to four inferred failures on one
trial, two inferred failures on two trials, and zero on two trials. Signaled failures remained
constant for all trials to insure that subjects remained focused on the task even when no
inferred failures were present. (See Figure 10.)
On Days 5 and 6, the controller re-introduction/comparison days, subjects were exposed to
two inferred failures on Trial 1, zero inferred failures on Trials 2 and 3 (the controller re-
introduction trials) and six inferred failures on the comparison trials (Trails 4 and 5).
There were no inferred failures on Trials 2 and 3, as controller reintroduction in operational
environments would likely not expose operators to specific failures. In addition, having
subjects control the system without failures is a stronger test of the hypothesis that
controlling a system activates their dynamic model of the system, thus making them more
sensitive to abnormal system operation. Additionally, exposing subjects to two trials
without inferential failures was consistent with their expectation bias for inferential failures
developed over the previous three days. This avoided implicit suggestion that Days 5 and 6
were any different than the previous days with the exception of the controlling re-
introduction.
90
Failure occurences, Experiment 3
Training
6 sig, 6 inf 6 Control

sig, 0 inf 6 sig, 0 inf 6 sig, 0 inf 6 sig, 2 inf 6 sig, 2 inf
6 sig, 6 inf 6 sig, 0 inf 6 sig, 0 inf 6 sig, 0 inf 6 sig, 0 inf 6 sig, 0 inf
Figure 10. Experiment 3 experimental design, failure occurrences by failure type,
randomized by subject on Days 2 - 4. Comparison trials outlined in bold.
Training
The training session was 30 minutes at the beginning of Day 1, and was identical to
Experiment 2. A brief message appeared at the beginning of Day 2 informing subjects that
the number of failures per trial might vary.
At the beginning of Day 3, a message appeared explaining that subjects should monitor the
system and detect both failure types. In addition, it was explained that there would be some
trials in which subjects were required to control the system and detect failures. Subjects
were therefore instructed to check the display at the beginning of each trial to see if the
system was in “Automatic” or “Manual” control mode. In addition, subjects were informed
that they would again see different numbers of failures on each trial for the remainder of the
experiment.
91
Results
Within-subject comparisons for Trial (4 and 5) within Condition (Post-Control [PC] and
Post-Monitor [PM]), and Trial by Condition were made for both signaled failure RT and the
combined RT and inferential failure error rate measure used in Experiments 1 and 2. An
analysis of variance (ANOVA) was used to test for Trail and Condition main effects and
interactions. Only Trials 4 and 5 on Conditions PC and PM were analyzed for differences.
Although the post-control and post-monitor conditions occurred on both Day 5 and Day 6
because of the counter-balancing, results are reported in a non counter-balanced manner as
condition Post-Control [PC] and condition Post-Monitor [PM] for purposes of clarity. The
raw RT and error rate data for Inferred failures are provided in Appendix D.
Signaled Failures
Signaled RT findings supported the experimental hypotheses and, as in Experiment 1,
suggested an inverse relationship between signaled and inferential failure detection
performance (see Figure 12.) There was a marginally significant main effect of Condition
(failure detection post-controlling [PC] versus post-monitoring [PM]; 846 vs. 719), [F(1,14)
= 3.39, p < .1), but no main effect for Trial (Trial 4 vs. Trial 5; 789 vs. 777). Further,
there was no significant Trial by Condition (PC vs. PM) interaction. Planned comparisons
for Trials between and within Conditions (PC vs. PM) for signaled RT yielded significant
differences. There were no significant differences within Condition PC for Trial (4 vs. 5;
827 vs. 865), nor for Condition PM (750 vs. 688). In addition, there was no significant
difference between Conditions for Trial 4 (827 vs. 750). However, there was a significant
difference between Conditions for Trial 5 (865 vs. 688), [F(1,14) = 4.9, p < .05) as shown
in Figure 11.
92
1000
900
Signaled Failure RT
800
700
600
500
400
300
200
4 4 5 5
Post- Post- Post- Post-
Control Monitor Control Monitor
Figure 11. Signaled failure detection performance, Trials 4 and 5, Conditions Post-Control
(PC) and Post-Monitor (PM).
Inferential failure detection performance supported the experimental hypothesis, but
differences were only marginally significant (p < .1). There was a marginally significant
main effect for Condition (PC vs. PM; 2.16 vs. 2.23), [F(1,14) = 3.39, p < .1], but no main
effect of Trial (4 vs. 5; 2.21 vs. 2.23). There was also a marginally significant Trial by
Condition interaction [F(1,14) = 3.74, p < .1], as post-controllers improved from Trial 4 to
Trial 5, but subject’s performance in the post-monitoring condition worsened from Trial 4
to Trial 5.
Planned comparisons of inferred failure detection yielded marginally significant differences
in the directions predicted by the hypothesis. There were no significant differences within
93
Condition PC comparing Trial 4 versus 5 (2.26 vs. 2.06, p = .13), nor in Condition PM
(2.16 vs. 2.31, p = .2), although the trend suggested by these data is interesting and will be
discussed further. There was no significant difference by Condition (PM and PC) for Trial
4 (2.26 vs. 2.16). However, there was a marginally significant difference by Condition for
Trial 5 (2.06 vs. 2.31), [F(1,14) = 3.21, p < .1) as shown in Figure 12. Because of the
marginally significant results using the combined index, RT and error rate were analyzed
separately for Trial 5. While there was no significant difference for RT (PC vs. PM; 8674
vs. 8839), there was a marginally significant difference for error rate (.2 vs. .29), [F(1,14) =
3.55, p < .1].
1000 2.35
Inferred failure index
2.3
Signaled failure RT
900
800 2.25
700 2.2
2.15 Signaled
600
2.1 Inf errred
500 2.05
400 2
300 1.95
200 1.9
4 4 5 5
Post- Post- Post- Post-
Control Monitor Control Monitor
Figure 12. Inverse relationship of Inferred and Signaled failure detection performance,
Trials 5, Conditions Post-Control and Post-Monitor.
Day 6 Controllers - Separate Analysis
Because the design was counter-balanced between Days 5 and 6, the possibility existed that
the benefit of controlling on Day 5 would persist longer than the superseding trials and
94
perhaps into the next day. This effect would thus prevent a controller advantage from
appearing in the data since the comparison trials for Day 5 controllers were Trials 4 and 5
on Day 6. Therefore, a separate analysis was conducted using only Day 6 controllers
(who’s comparison trials were from Day 5 and thus untainted by prior controlling.)
While these results are potentially confounded by Day, they did support the primary
hypothesis that controlling benefits subsequent monitoring performance, but also the
hypothesis that this benefit may be of considerable duration. There was a significant
difference for condition (PC vs. PM), showing an advantage for post-controllers in the
combined measure for inferential failures, (1.8 vs. 2.31), [F(1,7) = 6.08, p < .05).
Additionally, the same comparison for error rate yielded a significant difference, (.21 vs.
.5), [F(1,7) = 7.98, p < .05). There were no other significant differences. However, the
signaled and inferred RT means were all in the same directions as the data from the
analysis when both groups were used (Day 5 and Day 6 controllers.)
Discussion
The results of Experiment 3 support the hypothesis that periodic controlling can improve
subsequent monitoring performance and, importantly, increase the external validity of this
paradigm. Past research comparing the monitoring performance of individuals who
previously monitored and controlled using tracking tasks (Kessel & Wickens, 1982; Young,
1995), as well as Experiment 2 of this dissertation which used a more cognitively-oriented
dynamic task, strongly suggest that controlling a system makes one more sensitive to
dynamic features of the system and thus more sensitive to system failures. While previous
findings are theoretically important, their ecological validity is questionable because in
most operational environments in which operators monitor dynamic systems, their training
(and perhaps some operational experience) includes hands on system control.
95
This validity concern is especially acute in aviation environments where operators only
experience extended monitoring requirements after hundreds, if not thousands of hours
controlling the system manually. The design used in Experiment 3, however, demonstrates
that individuals with considerable hands-on manual control experience, then subsequent
monitoring exposure, will benefit from periodic reintroduction to controlling. While the
specific results showing improved inferred failure detection were marginally significant in
Trial five, this view is supported by Parasuraman, Mouloua, and Molloy (1996), who found
that monitoring performance was superior after a ten minute period in which some of the
previously automated tasks were returned to operator control. Although the anticipated
results were not present in Trial 4, the abrupt transition from controlling to monitoring was
likely responsible for the poor performance in Trial 4.
Signaled Failures
The differentiation between signaled and inferred failures was originally developed to
distinguish between performance improvements which were simply vigilance-like in nature,
versus performance improvements which were dependent on a more proficient or activated
state of system knowledge (see Experiment 1 for details). Since my theory states that
controlling should yield a more “activated” mental model for the operator or, more
specifically, a transition from the static system-operation based model to a dynamically
activated current state system based mental model, it was hypothesized that inferential
failures would be detected more easily when the operator’s mental model was in its dynamic
activation state. However, signaled failures, which required the operator to respond to
simple stimuli, should remain unaffected by mental model activation because effective
analysis of system behavior yields no advantage for the detection of a signaled failure. In
96
essence, an individual could have no understanding of system operation, yet be perfectly
effective in detecting signaled failures.
Results from Experiment 3 yielded the finding that past controllers, who presumably
benefited from a more activated mental model, were poorer at detecting signaled failures.
The likely explanation is that one aspect of an activated mental model is that subjects spend
more time scanning for vital information on the display (e.g., throttle information in the
periphery of the display), and thus less time focused in the center of the display where the
signaled failures occurred. While this phenomenon was not predicted, it is consistent with
the view that a proficient mental model guides perceptual activity to those features of the
task critical for success (Endsley, 1995).
This same signaled/inferred failure detection trade-off occurred in Experiment 1, and led to
features of Experiment 2 designed to explore this potential effect. The primary
modification to Experiment 2 was the removal of the peripherally located throttle
information on half of the trials. If past controllers’ signaled failure detection disadvantage
was a result of more time spent looking at throttle information, then removal of the throttle
should alter this signaled failure detection deficit. Results from Experiment 2 were
somewhat surprising in that past-controllers were significantly faster than past-monitors
over both conditions (discussed in detail in Discussion 2). However, with past-controllers
there was a significant difference in the signaled failure detection performance between the
throttle Visible and throttle NotVisible conditions, with poorer performance occurring in
the presence of the throttle display. There was no such effect for the past-monitors. This
finding supports the speculation that the presence of the throttle, and consequently the
subject’s attention to it, has a negative effect on signaled failure detection performance.
97
Further, it supports the contention that the past-controllers paid more attention to the
throttle than did the past-monitors when it was present.
Because the relationship between signaled failure detection performance and the post-
control condition were different in Experiment 1 than in Experiment 2, no prediction was
made for this relationship for Experiment 3. It should be noted that this effect would have
little operational significance, since even when significant the response time differences for
signaled failures were relatively small (e.g., 667ms vs. 604ms, from Experiment 2). Rather,
signaled failure detection performance serves as a measure of vigilance and behavioral
differences between groups, not as a measurement of how an operator would respond to a
signaled failure in a real task.
Interestingly, results from Experiment 3 paralleled Experiment 1 in that signaled failure
detection performance in the post-controlling condition was significantly worse than in the
post-monitoring condition. Likewise, inferential failure detection performance was better,
although only marginally significant, for the post-controllers. I believe that there are three
possible explanations for the apparent trade-off between signaled and inferential failure
detection performances. All likely explanations originate from the central point that
controlling makes a subject spend more time focused on throttle information and less time
focused on the center of the display where signaled failures occur. Because subjects do
focus more attention on the throttle, it is presumed that this information, at least in part, is
responsible for their improved inferential failure detection performance.
The first explanation for this trade-off in inferred and signaled failure detection
performance is based on the fact that subjects must allocate more resources to the throttle
portion of the task while controlling because one of their controlling tasks is throttle
98
management This task requires that subjects monitor the throttle display quite diligently
(subjects were told on the first day of the experiment that their bonus would be partially
determined by how well they managed the throttle on controlling trials) and to use the joy
stick to manipulate throttle position while controlling during controller re-introduction. It
is possible that the act of performing the task simply reinforces a scanning pattern which
incorporates the throttle. This explanation implies that when subjects return to the
monitoring task, their scanning behavior incorporates the throttle not because of increased
perceptual sensitivity to system attributes nor because a more activated mental model is
driving perceptual efficiency. Rather, it implies that scanning behavior is a result of
unconscious habit which, after controlling the system, happens to result in less scan time in
the center of the display and, therefore, poorer signaled detection performance. This
explanation, however, is not supported by the results. If a habit change was responsible for
the effect, then one would expect the strongest effect to occur directly after the controlling
condition, then weaken as subjects adapted to the monitoring task. However, this effect was
only present in Trial 5 of Experiment 3 when it should have been weakening. The
monitoring trial directly after the controlling re-introduction, Trial 4, showed no difference
between the post-control and post-monitor conditions. Additionally, if the origin of
scanning differences were habitual rather than cognitively driven, it seems unlikely that
there would be a resulting pay-off in inferential failure detection, although this is more
difficult to verify.
The second explanation for the change in scanning behavior is that the relationship
between throttle activity and overall fuel system behavior is strengthened when subjects are
forced to manipulate the throttle level by hand in the manual mode. This explanation
implies that continuous monitoring has the effect of weakening an individual’s
understanding of subtle system features, or perhaps causes individuals to fixate on features
99
of the task they perceive as having the greatest pay-off in terms of failure detection. This
shift, however, must be involuntary since subjects are instructed to detect failures to the best
of their ability in every condition. There is, therefore, no valid reason for subjects to
intentionally select a less effective strategy. While this shift may be characterized as
“peripheralisation” (Satchell, 1993) or perhaps “automation induced complacency”
(Parasuraman, et al., 1993), it is difficult to understand its origin. Perhaps the forced
inactivity of monitoring may induce a cognitive apathy or greater cognitive inactivity
which, unbeknownst to the subjects, has a harmful effect on their inferential failure
detection performance.
The final explanation for this trade-off, and the explanation most consistent with the
hypotheses, is that the reintroduction of controlling both the throttle and fuel pumps has the
effect of strengthening, or re-activating subjects’ mental models of the system. The effect of
this heightened system understanding, and the resultant increased sensitivity to system
operation, is that subjects pay greater attention to throttle activity and benefit from the
information it provides. Further, this explanation is consistent with Endsley’s (1995) view
that a good mental model guides perceptual activity to cues. This argues that the perceptual
process is generally outside of conscious awareness, and anecdotal evidence from subjects’
post-experiment comments suggests they were unaware that their attention to the throttle
display had changed in any way.
Inferential failure detection performance was in the pattern predicted by the hypotheses,
although mean differences were only marginally significant. In the post-controlling
condition, subjects were better at detecting inferential failures than when they had been
100
continuously monitoring. While the pattern between Trials 4 and 5 (the two comparison
trials) was not predicted, a marginally significant difference between groups occurred on
the fifth trial. Performance differences between the two conditions on Trial 4 were not
significant, and suggest that there is a transitionary period as subjects transfer from a
controlling to a monitoring mode. This is not surprising given the large differences
between the two tasks, but is likely a factor in need of further study before controlling
becomes an operational method of improving monitoring performance.
More importantly, however, I believe the fact that post-controllers improved from Trial 4 to
5, yet performance decreased in the post-monitoring condition, is operationally quite
significant. This is especially true given that the Experiment 3 pilot study data suggest that
subjects’ performances reached a negative asymptote by Trial 4 and remained poor through
Trial 5. This suggests that periodic controller re-introduction may have the effect of
“resetting the clock” on the deterioration of monitoring performance. The effect presumes
that controlling the system is a considerably different task than monitoring the system, as is
the case in this paradigm (while the objective is the same, the subjects’ activities between
the two tasks are quite different). However, in most operational settings, the difference
between controlling and monitoring is likely as profound as it is in this paradigm.
Further, the “resetting the clock” concept is quite consistent with the theory that controller
re-introduction has the effect of re-activating the operator’s mental model of the system,
thus shifting the state of the operator’s model away from the static state and towards the
dynamic mental model state. If mental models have a state of activation, as proposed in
this theory, then it is likely that there must be some decay of this activation. Viewed
another way, a dynamic mental model can only remain dynamic, and thus provide
perceptual and calculational benefits, for a certain period of time after the features of the
101
task supporting the dynamic activation cease. While this issue is not directly addressed by
this research (other than at a speculative level), it is another element of the theory which
needs further exploration. Systematic exploration of dynamic mental model decay could
provide critical information for the use of periodic controlling to ward off the negative
effects of continuous monitoring. As discussed in the beginning of this dissertation, it is
extremely unlikely that commercial aviation would ever return to an exclusive manual
control environment. However, controlling might be used for short periods of time to
produce the desired effect. It is critical, therefore, that the exact duration of the positive
effect of controlling be measured in addition to the amount of controlling required to
produce this positive benefit.
It is unfortunate that the inferential failure detection differences were only marginally
significant, but it should be noted that the design of this experiment produced an extremely
conservative test of the hypothesis. Young (1995), using a two dimensional tracking task,
found that subjects who controlled the system in the training portion of the transfer-of-
training experiment without exposure to failures did not show improved failure detection
performance during the monitoring portion of the experiment. This led Young (1995) to
conclude that controlling with failures, rather than just controlling, was responsible for the
improved failure detection performance of past-controllers. This conclusion was
problematic when generalizing to an operational environment because it implied that, in
order to derive benefit from controlling, the specific failures potentially encountered during
monitoring needed to be experienced while controlling. However, in Experiment 3 the
controller re-introduction intentionally excluded inferential failures. This was done to
achieve maximum external validity, and because it would be the most stringent test of the
theory that mental model re-activation was responsible for post-controlling inferential
failure detection performance, and not a result of specific failure type sensitivity.
102
The amount of time subjects spent controlling during controller re-introduction was
somewhat arbitrary. Trials 2 and 3 were chosen because they avoided the use of trial one
(the trial shown to be significantly different from the other trials in the Experiment 3 pilot
study), yet still allowed two trials for comparison at the end of the session. Given the
transitional factors likely affecting the fourth trial, and the fact that some subjects stated in
post-experiment interviews that they were caught off guard by the re-introduction of
controlling, it is likely that some subjects only had solid controlling experience in the third
trial. Each trial lasted 12 minutes, giving subjects 24 minutes or less of controlling
experience depending on how quickly they recognized and adjusted to the change in
participatory mode. Given the relative difficulty of the controlling task, the improvement in
inferential failure detection performance may have been larger if subjects had been afforded
more controlling time. Additionally, given the general downward trend in performance
over trials (as seen in the Experiment 3 pilot study and in Experiment 3 itself), it is likely
that a stronger effect would also have been achieved by increasing the amount of
monitoring time on days five and six when controller re-introduction did not take place.
Another potential factor affecting the results was that subjects did not perform the task at
the same time each day. Subjects were required to participate in the experiment for six
consecutive days. However, in order to accommodate student schedules, subjects were
allowed to participate anytime during the day. While it was originally thought that this
latitude would have little or no effect, post-experiment interviews revealed several
comments such as “I sure did a lot better on that experiment in the morning.” While the
variance attributed to this factor is unknown, future multi-day experiments should require
subjects to participate at the same time each day.
103
Experiment 3 was a successful test of the hypothesis because: a) signaled failure detection
performance was affected significantly by the controller re-introduction manipulation in a
direction predicted by the mental model hypothesis, b) inferential failure detection
performance was affected at a marginally significant level in a direction supporting the
mental model hypothesis, and c) both effects were strongest in the fifth trial suggesting
some long term reverse in the negative effects of continuous monitoring on failure detection
performance. While Experiment 3 generated several important new questions, it also
convincingly demonstrated that controller re-introduction during extended periods of
monitoring can have positive benefits for monitoring performance using an ecologically
valid design. Further, when Experiment 3 is considered in the context of this research and
other published works on this topic, it becomes increasingly clear that periodic return to
manual control may be one of the best weapons for fighting the negative effects of
continuous monitoring in operational settings.
Conclusion
This research adds considerable depth to previous studies in this field showing that manual
control of systems produces better system monitors. It lends support to the notion suggested
by myself and others (Endsley, 1995; Kessel & Wickens, 1982; Parasuraman, et al., 1996)
that the construct of a mental model may be the appropriate mechanistic explanation for the
“out-of-the-loop” performance deficits experienced by continuous system monitors. While a
mental model explanation for psychological phenomena may harbor seemingly excessive
complexity compared to simpler knowledge representation or learning approaches, I believe
that in the context of complex cognitive vigilance tasks, it effectively captures the
procedural, semantic, perceptual, and calculational factors affecting individuals’
performances in dynamic task environments. The operator is performing a complex
104
operation, and it may be a complex explanation which best captures this behavior. The use
of both signaled and inferred failures in this paradigm was novel and effective in
differentiating between vigilance decrements and deficits in the level of activation of the
operator’s mental model. To my knowledge, these experiments are the first to use failures
requiring different levels of cognitive processing for their detection in a single complex
vigilance task.
The first objective of this research was to extend findings that past controllers make better
system monitors by using an experimental paradigm more representative of real-world,
dynamic monitoring tasks. This was accomplished in Experiment 2. However, while
Experiment 2 replicated the basic findings that controllers make better monitors, the use of
the throttle mechanism as both a separate controlling task and as an important information
component of the monitoring task brought to light an important theoretical result. It
appears that operators who are trained by controlling a system develop a higher level of
understanding of system operation, especially in relation to features of that system which
they control. When those subjects are then placed in a monitoring condition, their more
comprehensive understanding allows them greater acuity for important system behaviors,
and they use that information to effectively detect failures. Not only is this finding
theoretically significant in its own right, but it supports contentions by Moray (1986) that
system monitors must be trained in a manual control mode if they are expected to
understand and appreciate system operation at a high level.
The fact that past-controllers appear to scan important features of the display for system
information offers a potential explanation for aircraft accidents in which programming
errors were made on the aircraft’s FMS, yet pilots failed to observe the aircraft’s unintended
behavior. In each of these occurrences, ample evidence was available on the displays, yet
105
the pilots failed to perceive this information and process it to a level which should have
signaled the existence of a serious problem. In effect, because the pilots had been
monitoring for extended periods of time before these incidents, their perceptual activity was
blinded because system monitoring failed to require perception of these system variables,
and their perceptual cycle became derailed, at least in relation to the primary goal. In
essence, the inactivity of monitoring yielded a weak dynamic execution of their flying
mental model so that access and understanding of subtle system behavior and the
consequent perceptual activity were severely affected. While this is speculation, I believe
that it is the best explanation to date as to why experienced pilots failed to perceive a
multitude of cues that indicated they were in grave danger.
This finding also supports the contention by Smolensky (1993) that the notion of situational
awareness may be related to certain physiological attributes. In fact, this finding strongly
supports the view that ocular movement may be a strong predictor of one’s situational
awareness at a given time. Situational awareness implies a strong dynamic execution of an
operator’s mental model of a task, and thus highly efficient perceptual activity as the
operator updates and integrates information pertaining to the task. I believe that this
perceptual activity should have a strong effect on one’s ocular movement, and is likely to be
highly indicative of one’s level of situational awareness.
The purpose of Experiment 3 was to use the ecologically valid task of the first two
experiments implemented in an experimental design more representative of real world
operations. The new design allowed all subjects to learn and perfect the task in a
controlling mode, as is the case in the aviation domain. Subjects then monitored for several
days, and after extensive monitoring, they were momentarily re-introduced to the
controlling task. Even this multi-day design compresses time compared to most operational
106
settings, but it is more realistic than previous research and external validity is increased by
insuring that all subjects are trained in the same hands-on manner. Results from this
experiment showed that even a 24 minute controller re-introduction can have a positive
impact on subsequent inferential failure detection performance. Importantly, signaled
failure detection performance was significantly worse after the controlling re-introduction.
The combination of improved inferred and poorer signaled failure detection performance
imply that even a short period of manual control within an extended period of monitoring
can cause subjects to return to a more effective pattern of scanning, while perceiving and
integrating system information more effectively. Further, controlling seemed to have the
effect of “resetting the clock” so that after a true 50 minutes of system exposure subjects
were performing as if they had just started the task, even though in previous experiments
general failure detection performance decreased as a factor of time on the task.
The results of Experiment 3 are the strongest evidence yet that periodic controller
reintroduction may be the best tool for airlines and other monitoring-intensive operations to
fight detrimental “out-of-the-loop” performance effects (Endsley, 1995). While the various
perspectives on this problem were outlined in the introduction of this paper, they have
generated few concrete solutions. However, the “controlling solution” appears not only to
be effective, but also easily implemented. In fact, the only cost seems to be the slight loss in
operational efficiency which occurs when human operators take control for a period of time.
Of course, before this recommendation is implemented, several important questions need to
be answered. First, an experiment of similar design should be conducted in a highly
realistic full mission simulator using commercial pilots. Second, more experimentation
needs to take place regarding the length of controlling versus the amount of resulting
positive benefit. It seems quite likely that the law of diminishing returns would apply to
107
controller re-introduction, but that point can not be determined without further
experimentation. In the same vein, it is also important to know the rate of decay of the
operator’s dynamic execution of their mental model, assuming the pilot hand-flies the
beginning of the mission, and only later is relegated to system monitor. It is also likely that
this factor is highly task dependent ranging from several minutes to several days. In fact, it
seems likely that the decay occurs in two dimensions: one dimension being task
complexity, the other capturing the extent to which it is a motor versus a cognitive task.
While these questions will be time consuming to answer, they will certainly yield vital
information for operational settings.
My goal with this research has been to elevate and refine past results showing the benefits
of controlling, and guide this line of research in a direction most beneficial for commercial
aviation and other industrial tasks which utilize continuous monitoring. I believe this line
of work shows that while the “out-of-the-loop” performance problem is both real and
serious, potential solutions are available. Further, unlike most solutions to serious
problems, where the benefit only slightly outweighs the costs, this “controlling solution”
has large benefits with few costs.
108
REFERENCES
Adams, J. A., Humes, J. M., & Stenson, H. H. (1962). Monitoring of complex

visual displays: III. Effects of repeated sessions on human vigilance. Human Factors, 4,
149-158.
Adams, J. A., Stenson, H. H., & Humes, J. M. (1961). Monitoring of complex

visual displays: II. Effects of visual load and response complexity on human vigilance.
Human Factors, 3, 213-221.
Adams, J. A., Tenney, Y. T., & Pew, R. W. (1995). Situational Awareness and
the Cognitive Management of Complex Systems. Human Factors, 37(1), 85-104.
Billings, C. E. (1991). Human-centered aircraft automation: A concept and

guidelines (NASA Tech. Memorandum 103885). Moffett Field, CA: NASA Ames
Research Center.
Borgman, C. L. (1986). The user’s mental model of an information retrieval

system: An experiment on a prototype on-line catalogue. International Journal of Man
Machine Studies, 24, 47-64.
Comstock, J. R., & Arnegard, R. J. (1992). The multi-attribute task battery for
human operator workload and strategic behavior research (Tech, memorandum 104174).
Hampton, VA: NASA Langley Research Center.
Confusion over flight mode may have role in A320 crash. (1992, Feb. 3).
Aviation Week & Space Technology, p.29.
Covey, R. R., Mascetti, G. J., Roessler, W. U., & Bowles, R., (1979, December).
Operational energy conservation strategies. Proceedings of the Institute of Electrical and
Electronic Engineers Conference on Decision and Control. Ft. Lauderdale.
109
Crash triggers review of AMR. (1996, January 1). Aviation Week & Space
Technology, p.30.
Curry, R. E. (1985). The introduction of new cockpit technology: A human

factors study (NASA Tech. Memorandum 86659). Moffett Field, CA: NASA Ames
Research Center.
Curry, R. E., & Eprath, A. R. (1976). Monitoring and control of unreliable

systems. In T. B. Sheridan and G. Johannsen (Eds.), Monitoring and Supervisory Control.
New York: Plenum.
Endsley, M. R. (1995). Toward a theory of situational awareness in dynamic

systems. Human Factors, 37(1), 32-64.
Endsley, M. R., & Kiris, E. O. (1995). The out-of-the-loop performance problem

and level of control in automation. Human Factors, 37(2), 381-394.
Ephrath, A. R., & Curry, R. E. (1977). Detection by Pilots of System Failures

During Instrument Landings. IEEE Transactions on Systems, Man and Cybernetics, SMC-
7(12), 841-848.
Eprath, A. R., & Young, L. R. (1981). Monitoring vs. Man-in-the-Loop

Detection of Aircraft Control Failures. In J. Rasmussen and W. B. Rouse (Eds.), Human
Detection and Diagnosis of System Failures. New York: Plenum.
Flach, J. M. (1995). Toward a theory of situational awareness in dynamic

systems. Human Factors, 37(1), 149-157.
Indian A320 crash probe data show crew improperly configured aircraft. (1990,
June 25). Aviation Week & Space Technology, p.84.
Jaginski, R. J., & Miller, R. A. (1978). Describing the human operator's internal
model of a dynamic system. Human Factors, 20, 425-439.
110
Johannsen, G., Pfendler, C., & Stein, W. (1976). Human performance and
workload in simulated landing approaches with autopilot-failures. In T. B. Sheridan and
G. Johannsen (Eds.), Monitoring and Supervisory Control. New York: Plenum.
Johnson-Laird, P. N. (1989). Mental Models. In M. I. Posner (Ed.), Foundations

in cognitive science. Cambridge: MIT Press.
Johnson-Laird, P. N. (1983). Mental Models. Cambridge: Cambridge University

Press.
Johnson-Laird, P. N. (1981). Mental models in cognitive science. In D. A.

Norman (Ed.), Perspectives on cognitive science (pp. 147-191). Norwood, NJ: Albex;
Hillsdale, NJ: Erlbaum.
Jordan, T. C. (1972) Characteristics of visual and proprioceptive response times

in the learning of a motor skill. Quarterly Journal of Experimental Psychology, 24, 536-
543.
Kantowitz, B. H., Casper, P. A (1988). Human workload in aviation. In E.

Wiener & D. Nagel (Eds.). Human factors in aviation. San Diego, CA: Academic Press,
Inc. (Chapter 6, pp. 157-188).
Kantowitz, B. H., & Sorkin, R. D. (1983). Human factors: Understanding

people-system relationships. New York: Wiley.
Kessel, C., & Wickens, C. D. (1982). The transfer of failure-detection skills

between monitoring and controlling dynamic systems. Human Factors, 24(1), 49-60
Kieras, D., & Boviar, S. (1984). The role of mental models in learning to operate
a device. Cognitive Science, 8, 255-273.
Mackworth, N. H. (1950). Research on the measurement of human performance.

(Medical Research Council special report series no. 268. London: HM Stationery Office).
111
Reprinted in H. Sinaiko (Ed.), Selected papers on human factors in the design and use of
control systems. New York: Dover Publications, Inc., 1960.
Moray, N. (1986). Monitoring behavior and supervisory control. In K. Boff

(Ed.), Handbook of perception and human performance (pp. 40/1-40/51). New York:
Wiley.
Nagel, D. C. (1988). Human error in aviation operations. In E. L. Wiener and

D. C. Nagel (Eds.), Human factors in aviation. New York: Academic Press.
Neisser, U. (1976). Cognition and reality. San Francisco: W. H. Freeman and

Co.
Norman, D. A. (1988). The Psychology of Everyday Things. New York: Basic

Books.
Norman, D. A. (1983). Some observations on mental models. In D. Gentner &

A. Stevens (Eds.), Mental models (pp. 7-14). Hillsdale: Erlbaum.
Norman, S., Billings, C. E., Nagel, D., Palmer, E., Wiener, E. L., & Woods, D. D.
(1988). Aircraft automation philosophy: A source document. Flight deck automation:
Promises and realities, [Workshop manual]. NASA Ames Research Center: Moffett Field.
Parasuraman, R. (1986). Vigilance, monitoring, and search. In K. Boff, L.

Kaufman, and J. Thomas (Eds.), Handbook of perception and human performance. (pp.
43/1-43/35). New York: John Wiley & Sons.
Parasuraman, R. (1987). Human-computer monitoring. Human Factors, 29, 695-

706.
Parasuraman, R., Mouloua, M., & Molloy, R. (1996). Effects of Adaptive Task
Allocation on Monitoring of Automated Systems. Human Factors, 38(4), 665-679.
112
Parasuraman, R., Molloy, R., & Sing, I. L. (1993). Performance consequences of
automation induced "complacency." International Journal of Aviation Psychology, 3(1), 1-
23.
Posner, M. I., Nissen, M. J., & Klein, R. M. (1976). Visual dominance: An

information processing account of its origins and significance. Psychology
Review, 83(2), 157-170
Sarter, N. B., & Woods, D. D. (1995). How in the world did we ever get into that
mode? Mode error and Awareness in supervisory control. Human Factors, 37(1), 5-19.
Sarter, N. B. & Woods, D. D. (1994). Pilot interaction with automation II: An

experimental study of pilots' model and awareness of the flight management system.
International Journal of Aviation Psychology, 4(1), 1-28.
Sarter, N. B., & Woods, D. D. (1992). Pilot interaction with cockpit automation:
Operational experiences with the flight management system. The International Journal of
Aviation Psychology, 2(1), 303-322.
Sarter, N. B., & Woods, D. D. (1991). Situation awareness: A critical but ill-
defined Phenomenon. The International Journal of Aviation Psychology, 1(1), 45-57.
Satchell, P. M. (1993). Cockpit monitoring and alerting system. Ashgate

Publishing: Aldershot, England.
Sekigawa, E., & Mecham, M. (1996, July 29). Pilots, A300 systems cited in
Nagoya Crash. Aviation Week & Space Technology, 36-37.
Sing, I. L., Molloy, R. & Parasuraman, R. (1993). Individual differences in

monitoring failures in automation. Journal of General Psychology, 120(3), 257-276.
Smolensky, M. W. (1993). Toward the physiological measurement of situational

awareness: The case for eye movement measurements. Proceedings of Human Factors and
Ergonomics Society 37th Annual Meeting, 41.
113
Thackray, R. I., & Touchstone, R. M. (1989). Detection efficiency on an air
traffic control monitoring task with and without computer aiding. Aviation, Space and
Environmental Medicine, 60, 744-748.
Van Cott, H. P., Wiener, E. L., Wickens, C. D., Blackman, H. S., & Sheridan, T.
B. (1996, October). Smart automation enhances safety: A motion for debate. Ergonomics
in Design, 4(4), 19-23.
Wickens, C. D. (1992). Engineering Psychology and Human Performance, New

York, NY: Harper-Collins.
Wickens, C. D., & Kessel, C. (1979). The effects of participatory mode and task
workload on the detection of dynamic system failures. IEEE Transactions on Systems,
Man, and Cybernetics, SMC-9(1), 24-34.
Wickens, C. D., & Kessel, C. (1980). Processing resource demands of failure

detection in dynamic systems. Journal of Experimental Psychology: Human Perception
and Performance, 6(3), 564-577.
Wickens, C. D., & Kessel C. (1981). Failure detection in dynamic systems. In J.

Rasmussen and W. B. Rouse (Eds.), Human detection and diagnosis of system failures.
New York: Plenum.
Wiener, E. L. (1993). Life in the second decade of the glass cockpit. Proceedings
of the Seventh International Symposium on Aviation Psychology, 1-11.
Wiener, E. L. (1989). Human factors of advanced technology (“glass cockpit”)

transport aircraft (NASA Tech. Memorandum 177528). Moffett Field, CA: NASA Ames
Research Center.
Wiener, E. L. (1988). Cockpit automation. In E. Wiener & D. Nagel (Eds.),

Human factors in aviation. San Diego, CA: Academic Press.
114
Wiener, E. L. (1985). Cockpit automation: In need of a philosophy (SAE Tech.
paper 851956). Washington D. C.
Wiener, E. L. (1984). Vigilance and inspection. In J. Warm (Ed.), Sustained

attention in human performance. John Wiley & Sons: New York.
Wiener, E. L., & Curry, R. E. (1980). Flight deck automation: Promises and
problems. Ergonomics, 23(10), 995-1011.
Williams, M. D., Hollan, J. D., & Stevens, A. L. (1983). Human reasoning about
a simple physical system. In D. Gentner & A. Stevens (Eds.), Mental models (pp. 131-
153). Hillsdale: Erlbaum.
Wilson, J. R., & Rutherford, A. (1989). Mental models in human factors.

Human Factors, 31(16), 995-1011.
Woodson, W. E. (1981). Human factors design handbook. New York: McGraw-

Hill.
Young, G. E. (1995). The Impact of Trial Length and Mode Experience on

Failure-Detection Performance in Monitored and Controlled Dynamic Tasks. Proceedings
of the Eighth International Symposium on Aviation Psychology, 1031-1036.
Young, R. L. (1969). On adaptive manual control. IEEE Transaction of Man

Machine Systems, MMS-10, 292-331.
115
APPENDIX A: Experimental task.
116
APPENDIX B: Experiment 1 Inferred failure RT and error rate.
Session 1 Session 2
Controllers Auto-pilot
Control
Yoked
Auto-pilot
Auto-Pilot
Yoked
Auto-pilot
"Yoked" monitors
Yoked
Yoked
Experiment 1 participatory modes.
Session 1 Session 2
Controllers .53/8963
.51/817l
.49/8660
.7/8116
.77/10506
.62/10401
.76/8911
"Yoked" monitors
.74/8389
.58/9026
Experiment 1 Inferred failure detection performance, error rate and reaction times. (Error
rate/RT).
117
APPENDIX C: Experiment 2 Inferred failure RT and error rate.
Session 1 Session 2
Day 1 Day 2 Day 3, (Monitoring)
Throt Vis
Controllers
Control Control
Throt NotVis
Throt Vis
"Yoked" monitors
Monitor Monitor
Throt NotVis
Experiment 2 participatory modes.
Session 1 Session 2
Day 1 Day 2 Day 3, (Monitoring)
.21/7315
Controllers
.37/8467l .26/8310
.24/7398
.3/7627
"Yoked" monitors
.44/8401 .33/8098
.25/7845
rate/RT).
118
APPENDIX D: Experiment 3 Inferred failure RT and error rate.
Training
Experiment 3 participatory modes, comparison trials in bold.
Training
Control Control Auto-Pilot Auto-Pilot .25/8902 .24/8277
Control Control Auto-Pilot Auto-Pilot .2/8674 .29/8839
rate/RT).
119

Complex Task Monitoring

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Complex Task Monitoring

Transféré par

Droits d'auteur :

Formats disponibles

THE TRANSFER OF INFERRED VS.

SIGNALED FAILURE DETECTION

A Dissertation submitted to the Graduate School in

Major Subject: Psychology

November 29, 1965 -- Born in Oakland, California

Human Factors and Ergonomics Society, National Chapter

Major Field: Experimental Psychology

THE TRANSFER OF INFERRED VS. SIGNALED FAILURE DETECTION

Doctor of Philosophy in Psychology

LIST OF FIGURES.......................................................................................................... viii

Pros and Cons of Automation ....................................................................... 3

Problems with Cockpit Automation............................................................... 8

The Role of the Vigilance Decrement .......................................................... 10

Loss of Motor Skills ................................................................................... 15

Reduction of Small Errors at the Cost of Occasional Large Errors............... 16

Are Multi-modal FMSs too Complex?......................................................... 18

Is Workload Lower, or Just Different? ........................................................ 25

Automation Induced Complacency.............................................................. 27

Paradigm History ........................................................................................ 40

Experimental Hypotheses, Experiment 1 ..................................................... 47

Implications for Experiment 2 ..................................................................... 65

Experimental Hypotheses, Experiment 2 ..................................................... 68

Conclusions and Implications ..................................................................... 80

Experiment 3 Experimental Hypotheses ...................................................... 83

Inferential Failures .................................................................................... 100

REFERENCES ............................................................................................................... 109

Appendix A: Experimental task...................................................................................... 116

Appendix B: Experiment 1 Inferred failure RT and error rate........................................ 117

Appendix C: Experiment 2 Inferred failure RT and error rate. ....................................... 118

Appendix D: Experiment 3 Inferred failure RT and error rate. ....................................... 119

Figure 1. Experimental design, Experiment 1. Session 2 counterbalanced

Figure 2. Experiment 1, Signaled Failure RT. ....................................... 56

Figure 3. Session 2 Inferential Failure detection performance (combined

Figure 4. Experimental design, Experiment 2, Session 2 counterbalanced

Figure 5. Session 2 Signaled Failure RT. .............................................. 73

Figure 6. Session 2 Inferential Failure detection performance (combined

Figure 7. Inferred and Signaled Failures by Trials, Days 4 - 7. .............. 87

Figure 8. Inferred and Signaled Failures, Days 1 - 9.............................. 87

Figure 9. Experiment 3 experimental design, participatory mode, counter-

Figure 10. Experiment 3 experimental design, failure occurrences by

Figure 11. Signaled failure detection performance, Trials 4 and 5,

Figure 12. Inverse relationship of Inferred and Signaled failure detection

perspective became overwhelmed by the multiple advantages promised by automation. In

trend, numerous examples of a new pilot-automation interaction problem are showing up in

and procedures intolerant of certain predictable, yet unavoidable errors.

automatically performed or controlled by self-operating machines, electronic devices, etc.”

can we automate?” The evolution of microelectronics and microprocessors quickly changed

in the design of commercial aircraft. Longtime proponents of automation presently

acknowledge that automation must be more ”intelligent” if it is to achieve the levels of

be too complex to be managed by human pilots (Sarter & Woods, 1993).

levels of “intelligence” combined with clearer operational features, a simpler pilot-

automation interface, and appropriate “human centered” operational procedures,

fundamental arguments for and against different implementations of automation will be

how the present research complements other research in this field.

Pros and Cons of Automation

actions and events.

2) Mediational errors: Failure to process information, solve problems or make decisions

3) Communication errors: Failure of communication between crew members, crew to Air

Traffic Control, and trainers and manufacturers to crew members.

introducing new problems.

communication). While the first generation of automation included two dimensional

completely automated (Satchell, 1993), with an emphasis on the integrated management of

becoming intimately involved in aircraft control only if a problem or unusual circumstance

systems management has not been a tremendous success.