Académique Documents
Professionnel Documents
Culture Documents
August 1997
ii
VITA
PROFESSIONAL SOCIETIES
PUBLICATIONS
Young, G. E. (1995). The Impact of Trial Length and Mode Experience on Failure-
Detection Performance in Monitored and Controlled Dynamic Tasks. Proceedings of
the Eighth International Symposium on Aviation Psychology, 1031-1036.
Berringer, D. B., Allen, R. C.. Kozak, K. A., & Young, G. E. (1993). Responses of Pilots
and Non-pilots to Color-coded Altitude Information in a Cockpit Display of Traffic
Information. Proceedings of the Human Factors and Ergonomics Society 37th Annual
Meeting, 84-87.
FIELD OF STUDY
ii
ABSTRACT
Previous research has shown that active controllers can detect failures in a simple dynamic
system faster and more accurately than passive monitors. Further, when controllers transfer
to a monitoring task, they also have better failure detection performance than subjects who
only monitor. This dissertation has two objectives: (a) to replicate previous tracking-task
based findings using a new, cognitively complex dynamic task with failure types which tap
into different cognitive processes, and (b) to use this new task paradigm in an ecologically
valid experimental design to further explore the demonstrated advantages of controlling.
Further, this dissertation advances the contention that the controller/monitor issue should
be conceptualized as a difference in the level of activation of the operator’s mental model of
the system. Results from Experiment 1 fail to replicate past findings of a controller
advantage, but yield the surprising result that past controllers may scan the display more
effectively. Experiment 2 improves upon the basic design of Experiment 1 and makes it
possible to explore the issue of controller versus monitor scan differences in greater depth.
Experiment 2 successfully replicated the controller advantage observed in tracking-task
experiments, and supports the conclusion of Experiment 1 that controllers scan the display
more effectively and use the information gained to their advantage. Experiment 3 uses the
same experimental paradigm, but in a design more representative of operational settings.
All subjects in Experiment 3 learned in a controlling mode and then transferred to the
iii
monitoring task. However, subjects were periodically reintroduced to the controlling mode
and its effects on their subsequent monitoring performance were measured. Results
demonstrate that controller reintroduction has a positive effect on monitoring performance.
Implications of these findings for operational environments are discussed in detail.
iv
TABLE OF CONTENTS
INTRODUCTION ............................................................................................................... 1
Advantages of Automation............................................................................ 4
Peripheralisation ......................................................................................... 13
Situational Awareness................................................................................. 30
Mental Models............................................................................................ 33
Relevant Research....................................................................................... 39
Present Research......................................................................................... 44
EXPERIMENT 1............................................................................................................... 49
Method ....................................................................................................... 49
Subjects ........................................................................................................................ 49
v
Apparatus ..................................................................................................................... 49
Task ............................................................................................................................. 49
Experimental Design .................................................................................................... 52
Training ....................................................................................................................... 53
Results........................................................................................................ 54
Signaled Failures .......................................................................................................... 56
Inferential Failures ....................................................................................................... 57
Discussion .................................................................................................. 58
Experiment 1 Conclusions........................................................................... 64
EXPERIMENT 2............................................................................................................... 69
Method ....................................................................................................... 69
Subjects ........................................................................................................................ 69
Task ............................................................................................................................. 69
Experimental Design .................................................................................................... 70
Training ....................................................................................................................... 71
Results........................................................................................................ 71
Signaled Failures, Session 1.......................................................................................... 72
Signaled Failures, Session 2.......................................................................................... 72
Inferential Failures, Session 1....................................................................................... 73
Inferential Failures, Session 2....................................................................................... 73
Discussion .................................................................................................. 74
EXPERIMENT 3............................................................................................................... 85
Method ....................................................................................................... 85
vi
Subjects ........................................................................................................................ 85
Task ............................................................................................................................. 85
Experimental Design Considerations ............................................................................ 85
Experiment 3 Design .................................................................................................... 88
Training ....................................................................................................................... 91
Results........................................................................................................ 92
Signaled Failures .......................................................................................................... 92
Inferential Failures ....................................................................................................... 93
Day 6 Controllers - Separate Analysis........................................................................... 94
Discussion .................................................................................................. 95
Signaled Failures......................................................................................... 96
Conclusion................................................................................................ 104
vii
LIST OF FIGURES
viii
ix
I know I’m not in the loop, but I’m not exactly out of the loop. It’s more
like I’m flying alongside the loop.
-Anonymous Boeing 767 Captain, (Wiener, 1988)
INTRODUCTION
The interaction of pilots and highly automated aircraft has become an increasingly studied
topic in recent years. This interest has been fueled not by the pursuit of academic
enlightenment, but rather by a series of fatal aircraft accidents in which the pilot/auto-pilot
interaction was the primary cause (Billings, 1991). I believe that the origin of this
precarious order of events is the result of industry, consumers, and safety advocates alike
embracing a promising technology. While some skeptics presented a view that the
pilot/auto-pilot relationship may not perform at the consonant level anticipated, this
fact, despite the concerns over cockpit automation, most would agree that an aircraft with a
high level of automation is more efficient, and possibly even safer than a similar aircraft
without it (Billings, 1991). Although accidents per passenger mile continue on a downward
the probable cause section of accident reports. The problem, however, is not that
automation exists in modern aircraft, but rather that a lack of foresight has yielded systems
Automation, when used in the aviation domain, is a broad term. It does not refer to a single
device, but rather a class of devices which control the various dynamic processes in an
1
aircraft ranging from basic mechanical systems to the actual task of “flying” the aircraft.
For purposes of clarity, when “automation” is used in this paper, it refers to the definition
used by Billings (1991): “A system in which many of the processes of production are
As will be discussed in greater detail, the level of automation in aircraft has been creeping
across the automation continuum for the last 80 years, and has only in the last several
decades become so prominent in the cockpit that it has raised serious concerns.
The question of whether or not to automate civilian transport and military aircraft of all
types is now merely academic (Billings, 1991). Prior to the 1950s the question was “what
the question to “what should we automate first?” By the early seventies, such questions
were rarely asked as virtually every component of the cockpit had become, or was on its
way to becoming, highly automated. It has only been since the late 1980s that the question
of “what and how much” to automate has again become an important and serious question
safety initially anticipated (Van Cott, Wiener, Wickens, Blackman, & Sheridan, 1996),
while skeptics of automation continue to believe that multi-modal automation may simply
While the future of automation seems to be progressing toward the integration of increased
discussed in detail in the following section. Proceeding that discussion, the various
2
experimental paradigms and research perspectives which have focused both directly and
indirectly on the issue of cockpit automation will be discussed in detail. These include
vigilance research, peripheralisation, motor-skill factors, the small error/large error trade-
off, automation reliability, flight management system mode complexity, workload levels,
automation induced complacency, situational awareness, and the role of mental models as a
framework for conceptualizing problems with automation. Finally, specific past research
relevant to the present research perspective will be discussed, along with a discussion of
One need only briefly review the history of both commercial and military aviation to
appreciate the desire of many to reduce the level of human operator control in the cockpit.
Although difficult to quantify, best estimates put the direct contribution of human error to
commercial aviation disasters at approximately 70% (Nagel, 1988), while the role of some
human error as a contributing factor in the chain of events leading to an accident is likely
even higher. This of course does not include fatal mishaps on railroads, ships, automobiles,
and industrial applications, but the percentages are likely similar (Van Cott, Wiener,
Wickens, Blackman, & Sheridan, 1996). Although human error is a complex concept, it
can generally be broken down into the following categories (Woodson, 1981):
1) Perceptual errors: Searching for and receiving information, and identifying objects,
correctly.
3
4) Motor errors: Failure to execute simple and complex, discrete and continuous, motor
behaviors correctly.
Automation has, interestingly, in its own duplicitous manner both addressed and
aggravated human error in each of these categories. The purpose of the following
discussion is to explore in detail both the pros and cons of cockpit automation, and the
complex way that it interacts with human error. In fact, this discussion will highlight the
observation that technology has the potential to solve problems while at the same time
Advantages of Automation
Ample studies of human error in the cockpit have come to the conclusion that a primary,
yet not exclusive cause of human error is excessive workload (Kantowitz & Sorkin, 1983).
Prior to the use of automation in the cockpit, pilots were forced to attend to and manage the
many complex systems in the aircraft (e.g., fuel distribution, engine management, cabin
pressurization, etc.) and fly the aircraft (e.g., manual control, navigation, and
aircraft control and simplified radio navigation, the second generation of aircraft
automation included the consolidation of displays into integrated displays, the transition
from raw data into more usable command information (e.g., a flight director), and the use
of air data computers to integrate multiple sources of information regarding air density and
direction into usable information both for the pilot and auto-pilot (Billings, 1991). Today,
the third generation of automation sees all complex systems in the aircraft partially or
all the automated devices on the aircraft (Billings, 1991). If they choose, pilots need only
4
be involved in commanding the automation through the Flight Management Computer,
occurs. It is for this reason that the majority of aircraft produced today are designed for two
pilots, as compared to four (two pilots, plus a flight engineer and navigator) which was the
case only thirty years ago. In fact, few would argue that the application of automation in
straightforward, the automation was relatively simple and its execution relatively error free.
Although such automation eventually eliminated the need for a flight engineer altogether, it
did relatively little to ease the workload of the flying pilots since much of the system’s
automation replaced the flight engineer’s duties, but not necessarily the pilots’. Although
flying an aircraft under cruise condition requires relatively low levels of workload, getting
the aircraft from the ground to cruise, and then from cruise to the ground requires
considerable effort on the part of the pilots (Billings, 1991). In addition, a statistical
breakdown of aircraft accidents demonstrates convincingly that these portions of the flight
contain the greatest risk. In fact, 90% of all accidents occur in the climb to, or descent from
cruising condition (Nagel, 1988). This statistic is made more profound by the fact that
these two phases of flight account for less than 40% of the flight time (Nagel, 1988). In
fact, accidents during cruise account for less than 9% of all aircraft accidents, yet cruise
flight accounts for 60% of flight time. Not only must pilots communicate with air traffic
control (ATC) and navigate to the correct location, but they must maintain control of the
aircraft in the desired attitude, altitude, vector, and velocity. Although this task is not
5
flight due to the presence of hostile weather, aircraft traffic and frequent ATC
Given the complexity of the flying task in these phases of flight, the increased risk of
mishap, and the strong correlation between pilot workload and pilot error (Kantowitz &
Sorkin, 1983), tremendous effort was put into the design of Flight Management Systems
(FMS) which, through computerized command of navigation and aircraft control, ease the
burden of controlling the aircraft in these critical phases of flight. Such automation, in
theory, eliminates many time- and resource-consuming tasks which contribute to pilot
attention could be directed to monitoring mission progress and overall system status, rather
than burdening a pilot's cognitive resources with command and control processing.
Further, should a partial or complete failure occur with the automation, the pilot could
quickly and effectively diagnose and re-engage at the point where the auto-pilot
relinquished authority.
If one were to tour the cockpit of a modern airliner, one would find a Flight Management
System (FMS) which not only has the capacity to successfully control and navigate the
aircraft through descent and ascent, but can fly the aircraft from takeoff to taxi at the
destination without a single pilot intervention of the aircraft controls (Billings, 1991). In
fact, until recently, it had been the operational policy of air carriers to encourage their pilots
to use their FMS to its fullest capacity, leaving the pilots with the duty of high level
The other compelling reason for the development of the FMS besides the belief that the
human’s limited capacity for workload was the primary barrier to aircraft safety was the
6
acknowledgment that human inner loop control precision was very limited (Billings, 1991).
Not only is the task of precise control tedious and perceptually demanding, but the high
control error levels of human operators mean considerable loss in efficiency. In fact,
microprocessor control of flight allows all flight phases and transitions to be accomplished
at maximal efficiency. Not only do human pilots lack the specific knowledge of how to
execute control maneuvers with perfect efficiency but, even with this knowledge, their
control accuracy is inadequate. The use of Flight Management Systems has thus introduced
engine efficiency. In fact, Covey et al. (1979) suggested that a 12% savings in fuel
changes to aircraft systems) with much of this gain coming through the use of automation.
Another study cited by Wiener and Curry (1980) suggested that a three percent reduction in
fuel consumption could result in a 26% increase in airline profits. Fueling this drive
toward efficiency was also the fact that the price of a gallon jet fuel went from 38 cents in
1978 to 70 cents in 1979 (Wiener & Curry, 1980), and went above a dollar in the 1980s
where it remains today. Improved efficiency clearly increases the profitability of airlines,
increases the need for new aircraft, lowers ticket prices, and reduces environmental impact.
Further, by lowering the cost of flying to the general public, overall transportation safety is
theoretically enhanced by moving people into air travel and away from more dangerous
The final reason for the push towards automation was the need to address specific human
error induced safety concerns such as controlled flight into terrain and air-to-air collisions.
In accidents such as these, it was often clear that sufficient information was present so that
given prompt and accurate interpretation of the information, such disasters could be
7
avoided (Billings, 1991). Computerized systems were thus developed to deal effectively
A good example of this automation is the Ground Proximity Warning System mandated by
Congress in 1975 to address a series of “controlled flight into terrain” incidents (Wiener &
Curry, 1981). This simple form of automation combines radar and barometric altimetry to
calculate height above ground and rate of change, therefore predicting when a possible
unintended conflict with the ground might occur (Billings, 1991). Such automation is
advisory only, thus leaving ultimate command authority to the pilot. Other examples of
“problem specific” automation include devices which force the control column of an aircraft
forward (known as a “stick pusher”) to avert an aerodynamic “stall,” and the Traffic Alert
Collision Avoidance System (TACAS) which receives transponder signals from other
aircraft and displays them in relation to one’s own aircraft thus warning of potential
conflict. There are many other such systems in modern aircraft and most would agree that
current “problem specific” automation has been quite successful, despite the common
appearance of problems when these systems are first instituted (Billings, 1991).
Although the previous section highlighted the positive aspects of increased automation, this
evolution has festered considerable controversy and numerous disasters. Some highly
visible incidents have illuminated the fact that the transition to automation has not been
problem free. I believe that careful analysis of the problems with aircraft automation show
machine systems, combined with other unpredicted problems discovered through the
8
analysis of accidents and incidents, data from the Aircraft Safety Reporting System (ASRS)
1. Given ample evidence of the poor monitoring ability of humans, can pilots be trusted to
3. Does automation cause a degradation of motor skills which will impair pilots when, by
4. Does automation eliminate frequent small human errors but give way to infrequent
serious errors?
6. Can human pilots effectively program and monitor the “multi-modal” Flight
Management Systems which have many programmable flight modes, and can switch modes
7. Does automation really reduce pilot workload, or has the workload remained the same
8. Does the reliability of automated systems cause complacency in the cockpit which has an
9. Does continuous monitoring by pilots cause a loss of “situation awareness” which could
10. Do pilots have “mental models” of the flying task which may be adversely affected by
“flying the automation” rather than flying the aircraft, and thus preventing effective system
monitoring?
9
This list of potential problems with automation are clearly not independent, and any
theorized or observed problems with the pilot-automation interface is likely causally related
to several of these factors. The following section discusses each of these factors in greater
detail.
Research on the ability of individuals to maintain effective sustained attention of real time
processes was first studied in earnest in World War II (Parasuraman, 1986), although some
concern can be traced to early questions about inspectors’ abilities to detect assembly line
defects (Wiener, 1984). The advent of radar produced the need for human operators to
monitor this new technology and efficiently detect enemy threats in a highly monotonous
task with few signals. Through a series of field experiments on both sides of the Atlantic, it
quickly became clear that the fragile nature of human monitoring performance meant that
normal working schedules were inappropriate for sustained attention tasks (Wiener, 1984).
In fact, early field studies by the RAF Coastal Command suggested that radar monitoring
from the United States suggested that monitoring periods not exceed 40 minutes (Wiener,
1984). In fact, research led by Mackworth (1950) verified in the laboratory what the
military had observed in the field and initiated the term “vigilance decrement” to describe
this phenomenon.
The vigilance decrement referred to the fact that after a given period of sustained attention,
human operators lost their ability to effectively discriminate between signals which they
otherwise could (Parasuraman, 1986). Although the causes of the vigilance decrement are
complex and still debated, its existence in simple sustained attention tasks is unquestioned.
10
Fortunately, remedies for this problem were unusually simple. Once the onset of the
vigilance decrement was established, a work shift routine was designed so that no operator
was monitoring past the period of full vigilance, and each operator was given enough of a
The success in early vigilance research was due in large part to the fact that the actual task
of observing radar displays lent itself well to the design of experimental tasks for laboratory
research (Parasuraman, 1986), although a minority criticized the early research for its
artificially high signal rates (Wiener, 1984). This convenient and uncommon
circumstance, combined with parallel findings in field research, meant that the early
vigilance research was widely accepted and did not suffer the typical validity issues
associated with laboratory research. This was not the case, however, when researchers tried
to apply vigilance research findings to other, usually more complex, sustained attention
tasks. Not only was the majority of vigilance research conducted in the laboratory focused
on extremely simple, low arousal tasks, but when more complex paradigms were used, the
Early vigilance research in which more complex paradigms were used sometimes found
laboratory phenomenon not applicable to complex real world tasks (Parasuraman, 1986).
One predominant view of complex task vigilance was based on research conducted by
Adams et al. (1961). This research used a simulated air defense task in which the number
of non-signal targets was either 6 or 36. Although overall detection performance was worse
when the non-signal targets were more abundant, performance did not change with
increases in time spent at the task. This finding led Adams et al. (1962) and others (as
cited in Parasuraman, 1986) to believe that complex tasks yielded sufficient arousal to
11
prevent a vigilance decrement. Still other studies, however, demonstrated a strong
vigilance decrement. These studies include a three-clock version of the Mackworth clock
Several theories have been offered for the disparity of results in complex task vigilance
research. Most importantly, the tasks and procedures vary widely and thus make it difficult
Adams et al. (1961) suggested that because of large individual differences in complex task
performance, slight vigilance decrements may exist, but fail to reach statistical significance
variability in detection rate for a dual-source visual discrimination task was nearly twice
that for a single source task. Another explanation contends that since complex task
performance is already poor, there is little opportunity for it to get worse with time (Davies
& Tune, 1969). Additionally, it has been proposed that when a complex-task vigilance
decrement exists, it may be only a slight decrement in sensitivity, thus having only slight
Only a brief review of complex task vigilance research is necessary to appreciate the
historical difficulty in finding a consistent and robust vigilance decrement in complex tasks.
Although some may see this as an implication of a weak if not irrelevant phenomenon,
others have used these diverse findings to support the contention that a vigilance decrement
1986). Regardless, however, there is little doubt that the diversity of complex tasks in the
valid enough to generalize findings to the operational environment. Thus, it has been
proposed that any theoretical findings from laboratory settings must be conducted in
12
parallel highly realistic paradigms to insure ecological validity (Satchell, 1993). While an
undertaking such as this may be unrealistic, it would likely quell some of the debate over
Peripheralisation
The term peripheralisation has been used to describe the process of roll change that pilots
experience as they become increasingly distanced from the essential flight process as levels
of automation increase (Billings, 1991; Norman, Billings, Nagel, Palmer, Wiener, &
peripheralisation process stems partially from the failure of aircraft designers to focus on
human needs in an “out of the loop” control environment (Wiener & Curry, 1980) but it
Satchell (1993) has organized the effects of peripheralisation into the following three
categories, some of which will be discussed in greater detail elsewhere in this paper:
the consistency and reliability of automation, both of which have been shown to affect
system, for example an altitude alerting system, becomes the primary information source for
the operators. Such task inversions usually result in altered operator monitoring behavior.
13
-Automation Deficit: The temporary and relative reduction in manual performance upon
sufficiently deal with a suddenly increased workload. An example of this is the high
workload levels often encountered below 10,000 ft. by cockpit crews following extended
2. Communication: Research into aircraft accidents has generated many examples that
Flight crews who communicate effectively have been shown to communicate more
frequently, openly, directly and concisely compared to ineffective crews. However, studies
comparing crews in aircraft with different automation levels show that as level of
3. Situational awareness: “the accurate perception of the factors and conditions that affect
-The Big Picture: Although related to situational awareness, the big picture refers to
awareness of the state of the system at a global level. An example of this would be the
China Air Lines crew who let their 747 stall and enter a spin while they attended to an
engine problem.
system automation often translates, interprets, and integrates raw data prior to being
presented on the pilot/system interface. Although this translation of raw data is often
14
lower mental workload by reducing the amount of raw data received by the pilot, then some
peripheralizes the crew to the degree that they no longer attend to the navigation of the
aircraft, a devastating result may occur should the automation fail or be misdirected by the
pilots.
The loss of motor skill as a result of lack of practice is a major concern accompanying
increased automation in the cockpit (Endsley, 1995). Not only is the problem salient and
fairly well studied, but it has been a frequently reported concern of pilots of automated
aircraft (Hughes, 1989; Wiener & Curry, 1980). Further, Moray (1986) has emphasized the
need for operators of automatic systems to have extensive manual practice even though it
will seldom be used in actual operation. Interestingly, however, recent accidents involving
automation issues have not shown manual skill proficiency to be a primary concern, since
accidents have generally occurred when the auto-pilot was flying, or the automation and
pilot were “fighting” for control of the aircraft, thus interfering with each other’s relative
control commands (Aviation Week, 1996). This is not to say that degradation of skill is not
a problem, but rather that other automation factors seem to be more causally related to
aircraft mishaps.
It is ironic, however, that proponents of automation have long argued that much of the
value of automation resides in the fact that pilots, when required, can easily intervene and
pick up where the auto-pilot left off. In reality there is considerable evidence that persistent
relationships between system variables (Kessel & Wickens, 1982; Shiff, 1983; Wiener &
15
Curry, 1980). It is common for co-pilots, when transferring from highly automated wide-
body aircraft to narrow body, less automated aircraft, to need a transition period to revive
their proficiency in manual control skills (Wiener & Curry, 1980). Further complicating
this issue is the fact that with the introduction of highly sophisticated FMSs, complimentary
changes in airline procedures discourage manual flight (Billings, 1991). Rather than being
evaluated on their manual flying skills, pilot proficiency is judged by the effective use of the
vastly capable and complex “integrated flight path and aircraft management systems” in the
cockpit (Billings, 1991). It should also be noted that several recent incidents, for example
the crash of a USAir 737 in Pittsburgh, although as yet unsolved, may have been related to
a wake turbulence induced unusual attitude which became catastrophic when attitude
The introduction of technology into society has created the interesting phenomenon of
reducing small errors of precision, at the cost of occasionally introducing very serious large
errors. Consider the frequently cited example of the digital alarm clock. The introduction
of this device meant that the accepted 10 to 15 minute precision error of the analog alarm
clock was now eliminated (Wickens, 1992). However, this technology meant that the
occasional “set up error” (Wiener & Curry, 1980) could yield an error, although infrequent,
of 12 hours, nearly 48 times the magnitude of the analog alarm clock. The same potential
for occasional catastrophic errors exists in the automated cockpit (Wiener, 1988). Consider
16
An Airbus A320 flew into the ground while on a non-precision approach to
determined that the aircraft was descending at a rate of 3,300 ft/min during its
pre-crash descent, far steeper than the 700 ft/min required by the approach
(Aviation Week, 1992). However, the VOR/DME approach chart for that
airport required a 3.3 degree angle of descent, which is what the pilots most
likely intended to input into the Flight Management System. Rather than
entering "3.3" into the "Track/Flight Path Angle" descent mode, the mode
pilots caused the aircraft to descend at 3,300 ft/min, rather than the intended
3.3 degree angle of descent. The A320 crashed short of the airfield in
conversations indicate that the pilots never realized an error had been made.
This example, besides being an example of a “mode error” which will be discussed in detail
would be difficult for human pilots to fly a perfect 3.3 degree angle of descent, it is unlikely
that in attempting to do so, they would error by such a large magnitude. The automated
system, however, can fly the aircraft at a 3.3 angle of descent nearly free of error, but must
There are abundant examples of such errors being made. However, most are detected
before the error is elevated to a catastrophic level (Billings, 1991; Wiener, 1988). Wiener
(1988) has suggested 4 approaches to dealing specifically with this problem. First, he
suggests that systems need to be less cordial to erroneous input at the human interface level,
rather than depending on training and correct operation to alleviate the symptoms of bad
17
design. Second, he suggests that systems be designed to be less vulnerable to unsafe actions
even in the advent of erroneous input (for example Ground Proximity Warning Systems).
Third, systems must be designed with error checking, or a certain “intelligence” capability
to deal with the logic of inputs given other relevant factors (for example, comparing pilot’s
altitude inputs with an internal terrain map). Finally, Wiener (1988) suggests that the
entire system, including Air Traffic Control, be designed to be less tolerant to overall
system error (e.g., insuring that aircraft follow the exact instructions provided by ATC).
None of the suggestions raised by Wiener (1988) are necessarily easy to implement, but
given the catastrophic outcomes of the “large error” problem in the modern cockpit, many
changes are being made in the direction of these ideals. Although GPWS and TCAS are
now mandatory forms of error checking (Van Cott, et al., 1996), improved FMS user
interfaces and terrain checking are in the process of being perfected. Further, this issue
field (Van Cott, et al., 1996). Although the assumption is that smart automation would
detect and resolve large errors, ample evidence suggests that this is a complex and difficult
undertaking.
Early auto-pilots were simple devices which could turn to and hold a heading, climb to and
hold an altitude, or track a navigation signal for the purpose of decreasing the need for
continuous hands-on control of the aircraft (Billings, 1991). More important was that fact
that every behavior of the auto-pilot had to be specifically commanded by the pilot;
commands to the auto-pilot were never more than one flight transition away from the
18
current flight condition (e.g., the pilot could command the auto-pilot to turn to a specific
heading and hold that heading, but could not at that time input a future heading change
command). As automation transitioned into its second generation in the late 1950s
(Billings, 1991), automatic control of the aircraft became gradually more sophisticated,
with devices becoming autonomous from continuous pilot command. Examples of such
devices are the yaw damper, which automatically initiates slight rudder movement to
prevent the “Dutch roll” phenomenon in swept wing aircraft, and “pitch trim
compensators” which control the tendency for aircraft to pitch down at near-supersonic
speeds. Although these devices and others like them increased safety and efficiency, and in
some cases, made high speed transport a reality, they also set the precedent that
autonomous automation could be introduced into the cockpit safely and successfully without
As automation transitioned into its third generation (Billings, 1991), the objective of
integrating and managing the automatic systems to further reduce workload, increase
safety, and increase efficiency lead to FMSs with phenomenal capability. Not only could an
entire flight be preprogrammed into the system, but this execution of the flight could be
undertaken without pilot intervention. Intrinsic to this automation capability was that the
system would have many “modes” available to command the flight. Just as pilots have
several methods to accomplish the same task in an aircraft (e.g., on approach, one can
control altitude with power changes, pitch changes, or approach with the throttles at idle,
and control altitude with pitch and wing spoilers), so too were multiple capabilities built
into FMSs, both for efficiency, and to provide the pilots with greater flexibility. With this
increased ability, however, came a certain need for the automation to deal with peculiar
situations without pilot intervention. This meant that upon reaching certain predetermined
target values or reaching certain “protection limits,” (i.e., the system senses that an unsafe
19
condition has occurred) the FMS can change its “mode” of operation or over-ride pilot
inputs.
Automation with this level of sophistication has led to two specific pilot interaction
problems (Sarter & Woods, 1991). The first problem is that, given the inherent complexity
of the system, greater demands are placed on the pilots to understand the multiple
ramifications of each FMS “mode.” Because a particular mode may behave differently
under different circumstances (e.g., at different altitudes), the pilot must understand in
advance what the FMS will do given certain inputs, and remember what FMS abbreviation
An Airbus A320 flew into the ground while on short approach to Bangalore
Airport in India. The pilots had inadvertently set the auto-pilot to "Idle Open
Descent" mode, which sets the auto throttle to idle, rather than one of the two
descent modes in which auto throttles are active (Aviation Week, 1990).
knots below the desired airspeed of 132 knots since altitude was maintained by
pitch rather than thrust. By the time the pilots realized their error, they were too
slow and too low to recover, and crashed short of the runway killing 94 of the 146
people on board.
The second FMS complexity problem is that the behavior of the automation is contingent
upon certain “situational” factors in addition to pilot inputs, often making it difficult for the
pilots to predict the behavior of the auto-pilot either upon engaging the auto-pilot, or in
20
monitoring its behavior as it progresses along the flight (Sarter & Woods, 1992). Consider
An Airbus A300-600 stalled 1800 feet above the ground on approach to Nagoya
Airport, Japan, following a chaotic battle for control of the aircraft between the
pilots and the auto-pilot. While flying the aircraft manually with flight director
guidance and auto-throttles engaged, the co-pilot inadvertently engaged the TOGA
(take off/go around) lever on the throttle quadrant. Realizing the error, the captain
correct for the now off-glideslope condition, the pilots engaged the auto-pilots 1 and
2, believing that the auto-pilot would return them to the desired flight path.
Instead, the auto-pilot resumed the TOGA mode which had accidentally been
selected by the co-pilot previously. Realizing this, the pilots applied forward
pressure on the yoke to correct for the auto-pilot induced 18 degree nose up
condition. However, because the FMS software inhibits automatic “yoke force auto-
pilot disengagement” below 1500 ft, the auto-pilot remained engaged and initiated
the pilots pushed down with all their strength, the trim system continued to push the
nose upward for twenty seconds until the pilots manually disengaged the auto-pilot.
Several seconds later, the extreme nose up condition and deteriorating airspeed
unexpectedly caused the “alpha floor” protection mode to engage due to excessive
angle of attack. This “alpha floor” condition commanded a thrust increase inducing
an even greater nose-up attitude. Although the captain promptly disengaged the
21
“alpha floor,” the aircraft was far out of trim, the airspeed was at 78kts, and the
altitude 1800 ft. The aircraft stalled and could not be recovered before it hit the
ground, killing 264 people. (Aviation Week and Space Technology, 1996)
Although this incident seems obscure and hardly believable, very similar incidents also
occurred in 1985, 1989, and 1991 (Aviation Week, 1996), and highlight the dangers of
sophisticated, they in fact begin to fly more like humans (albeit more precisely) using a
complex combination of methods to achieve their goal. However, as this occurs, it makes it
more difficult for those pilots monitoring the automation to predict and, in fact, understand
what the automation is doing. Given the flexibility of the FMS and the “dynamism of flight
path control,” serious cognitive demands are placed on the pilots (Sarter & Woods, 1992).
Not only must they decide the level and mode of automatic control, but they must diligently
Sarter and Woods (1992, 1994), while seeking empirical evidence for pilots’ anecdotal
suggestions of confusion about the Flight Management System operation, found converging
and complementary data demonstrating both serious gaps in pilots’ understanding of the
system logic and difficulty in tracking the behavior of the FMS while in flight.
Surprisingly, Sarter and Woods (1992) found that 55% of Boeing 757 pilots surveyed
agreed with the statement “In B-757 automation, there are still things that happen that
surprise me.” Further, 20% of the pilots agreed with the statement: “There are still modes
22
In a follow-up study, Sarter and Woods (1994) created an FMS command-laden
experimental scenario which was then flown in a part-task simulator designed to teach FMS
operations. The goal of this study was to observe pilots using the FMS in a simulated
mission to understand pilots’ mental representations of the FMS logic. The results showed
that the majority of pilots had little difficulty with routine operations ranging from
establishing a holding pattern to setting up for an ILS approach. However, they found that
70% of pilots showed deficiencies in one or more of the following less standard procedures:
2. anticipating mode indications on the ADI display throughout the take-off roll,
7. describing the system behavior differences above and below 1500 ft for a loss of radio
“signal” condition.
An example deficiency was that 80% of pilots did not realize that aborting an auto-throttle
take-off required the pilot to manually disconnect (as opposed to an automatic disconnect)
the auto-throttles in order to prevent them from re-accelerating after manual intervention.
The authors (Sarter & Woods, 1994) attribute these deficiencies to two separate factors.
They see the first three deficiencies related to weak mode awareness, both in terms of
dealing with an FMS related failure and with anticipating system status and behavior. The
second factor, raised by the last four deficiencies, point to an impoverished knowledge of
the “functional structure” of the FMS (Sarter & Woods, 1994). It is quite obvious from
23
these findings (Sarter & Woods, 1994, 1992), and others (Wiener, 1989) that even
experienced pilots have trouble with the complexity of the FMS. Sarter and Woods (1992)
suggest that one of the primary problems with FMS systems is the poor feedback given to
pilots about the behavior of the FMS, exacerbating the already difficult task of predicting
system behavior. In fact, both accidents and empirical investigations have led to
persist in revealing that pilots have trouble understanding and predicting FMS behavior,
It has also been suggested that any system which is “multi-modal’ in nature is
difficult for human operators (Norman, 1988; Wiener, 1989), and thus problematic
regardless of interface and training issues. Further, the problem remains that even
if a crew has complete understanding of the FMS system, 84% of FMS related
reports to ASRS indicate that “programming errors” still present the highest
errors is compounded by the fact that pilots have trouble predicting the behavior of
the FMS. Clearly, if some element of “wait and see” is built into pilots’
example:
A Boeing 757 crashed into San Jose Mountain on approach to Cali, Colombia, while
the mountain. A post-crash analysis of flight data and cockpit recordings determined
that the pilots of the aircraft entered a command into the FMS to fly direct to Tulua
24
VOR in order to comply with an ATC request to report their position once over the
VOR. However, the pilots failed to realize that they had already passed Tulua VOR, so
their command caused the FMS to turn the aircraft back in the direction from which
they had come. While the behavior of the aircraft surprised the pilots, they continued
to let the FMS turn the aircraft in the wrong direction for approximately 90 seconds.
With suspicion growing, the pilots switched the auto-pilot to the “heading select” mode
in an attempt to return the aircraft heading toward Cali. However, the 90 second turn
to the left, and then the corrective turn to the right placed the aircraft off course and in
climb after prompting by the GPWS, but the 757 hit the top of a 12,000 ft mountain
killing 164 of the 167 individuals on board. (Aviation Week and Space Technology,
1996)
Only as evidence builds that the autopilot behavior is deviating from expectation will the
pilots begin to suspect a programming error. Although industry has advocated better
training and researchers have advocated better FMS interface and cockpit display design, it
seems likely that mode confusion will persist as long as FMS operation is optimized for
As suggested earlier, the clear relationship between high workload and increased
probability for human error has been a strong force in the push toward cockpit automation.
Early systems automation was successful in reducing the workload for pilots (Billing, 1991)
which was welcomed given that pilots “prefer to be relieved of much of the routine manual
control and mental computation in order to have time to supervise the flight more
25
effectively and to perform optimally in an emergency” (Wiener, 1988). Further, airlines
have long desired wide-body aircraft requiring only two crew members, and the reduction of
Defining workload and then measuring it has always been a difficult task for engineering
psychologists (Wiener, 1985), yet aircraft designers were quite confident that heightened
levels of automation would reduce workload to a large degree (Wiener, 1988). Two
interesting factors have arisen since highly automated aircraft were certified for two pilot
operation on the grounds that workload had been sufficiently reduced. First, it seems quite
evident from pilot studies that, while manual workload may have been effectively reduced,
mental workload was not reduced and may have actually increased. This is because the
Wiener (1988) suggests that automation now calls for more programming, planning,
sequencing, and alternative selection, all of which add up to considerable levels of cognitive
processing. In fact, a study by Curry (1984) of 100 Boeing 767 pilots found that only 47%
agreed with the statement, “automation reduces overall workload.” Responding to another
question, 53% of the pilots agreed with the statement, “Automation does not reduce
workload, since there is more to monitor now.” Although subjective in nature, it is clear
that many pilots find the automation management task quite demanding and perhaps more
demanding than the highly manual flying task which the automation replaced (Kantowitz
& Casper, 1988). Not only does this bring into question the validity of the certification
findings, but it implies that if the relationship between workload and error still exists in the
automated cockpit, automation may now be affording new opportunities for human error
26
Another factor related to automation induced workload is the temporal spacing of workload
throughout the different phases of the flight. Automation has reduced the workload in
cruise phases of flight to almost nothing (Billings, 1991). However, workload tends to
increase dramatically upon entering the “terminal” area because of two factors. First, since
terminal area flight usually requires some combination of directional, altitude, and speed
changes, not to mention potential FMS mode changes, the monitoring task becomes much
more involved. The pilots must monitor the aircraft’s behavior in an attempt to stay ahead
of the FMS, and they must also monitor the FMS commands to insure that the information
programmed into the FMS at the beginning of the flight is correct. The second factor in
increased terminal area workload is the commonly cited mismatch between Air Traffic
Control procedures and the FMS (Wiener, 1985). Assuming that ATC requires deviation
from a standard approach, which is often the case, the pilots must spend “heads down” time
reprogramming the FMS, while maintaining constant communication with ATC and
scanning for other aircraft. In the future ATC may communicate directly with an aircraft’s
FMS, thus reducing both communication and programming errors and allowing the pilots
greater opportunity for scanning for other aircraft, but this feature is still several years
away.
Molloy, & Singh, 1993; Thackray & Touchstone, 1989). In addition, the term complacency
has been used to describe inadequate cockpit performance previous to highly automated
by a low index of suspicion,” while the ASRS coding manual defines complacency as “self-
27
satisfaction which may result in non-vigilance based on an unjustified assumption of
satisfactory system state,” (Parasuraman, et al., 1993). Singh, Molloy, and Parasuraman
one’s attitude toward automation coexistent with other factors. Singh et al. (1992) found
four independent factors revealing a potential for complacency, those being confidence,
Overconfidence in automation may not, however, be a strong enough factor itself to cause
complacency. Although Thackray and Touchstone, (1989), attempted to induce the effects
of complacency by having a reliable automated Air Traffic Control task fail both in the
beginning and end of a two hour experimental session, they failed to show a reliable
performance difference between the two failures. Further, their research did not yield a
difference in detection efficiency between the group with automated assistance and the
group who performed the task without assistance. Thackray and Touchstone (1989)
reasoned that their failure to find a difference may have been due to the short session, or
perhaps because the subjects performed only a monitoring task, with no other tasks
competing for resources. Parasuraman et al. (1993) reasoned that the effects of automation-
induced complacency are more likely when the operator is responsible for many functions,
as is often the case in aircraft incidents in which complacency was a factor. Singh et al.
(1993) reasoned that complacent behavior exists only when both a complacency potential
exists on the part of the pilots, and certain other factors coexist. Those factors include pilot
inexperience, fatigue, high workload, and poor communication (Singh, Molloy, &
Parasuraman, 1993).
Based on the reasoning that high workload may cause automation induced complacency,
Parasuraman et al. (1993) had subjects detect failures of an automated system monitoring
28
device while those subjects controlled a fuel management system and a tracking task.
Automation reliability was manipulated with groups either seeing high or low constant
reliability or variable reliability automation alternating from high to low every ten minutes.
Additionally, subjects were placed in a “monitor only” group or were in a group which
monitored and controlled all tasks. Results clearly showed that detection of automation
failures was worse for subjects in the constant reliability condition. Results also showed
that subjects whose only task was to monitor showed no performance differences due to
automation reliability. This finding supported earlier findings (Thackray & Touchstone,
1989) that workload must reach a certain level before complacency related performance
deficits will be seen. The authors viewed these results as the first evidence that automation
induced complacency could be produced by high workload and highly reliable automation.
These findings are significant in the operational setting because workload can be very high
at certain times and the automation extremely reliable. The problem however, is that the
automation is not perfectly reliable. As discussed earlier in this paper, pilots often enter
incorrect information into the FMS which then diligently carries out exactly what it is
commanded to do. In addition, the automation is capable of failure even when the correct
information is entered into the system. Consider the following example cited to this author
While in cruise over the Mediterranean en route from London to Cairo, the pilots
of a Boeing 767 monitored as the FMS flew the aircraft. Unbeknownst to the
pilots, the auto-throttle was gradually but erroneously reducing the thrust being
supplied from the engines. While this was occurring, the auto-pilot was
maintain the altitude specified by the FMS. Because of the moderate rate of thrust
29
reduction and smoothness with which the auto-pilot responded, the pilots failed to
sense the cues normally associated with changes in pitch. Fortunately, the captain
eventually noticed that the airspeed was unusually low, and manually accelerated
the throttles. However, by the time the anomaly was noticed, the airspeed had
dropped 25kts below the appropriate cruise speed, and only 15kts above stall
speed.
operational settings. Not only does automation simply fail to perform correctly, but
Situational Awareness
1995), new terminology, research methods and constructs have evolved to suit this research
area. Of these, the concept of “situational awareness” has evolved as a means of both
conceptualizing the problem, and, in fact, measuring it. The use of situational awareness as
a causal agent is strongly supported by some (Endsley, 1995) or used only as a label for a
variety cognitive processing activities by others (Sarter & Woods, 1995). It is viewed as a
“buzzword of the ‘90s,” rather than an effective research paradigm (Wiener, 1993), and
(Flach, 1995). Because of the effort dedicated to this research paradigm in both civilian
and military settings, situational awareness will be treated as a valid construct for the
purposes of this paper, discussing how it has been used to conceptualize and measure the
30
Although there have been numerous definitions proposed for situational awareness, most
have not been applicable across different task domains (Endsley, 1988). However, the
definition settled on by the most prolific researcher in this area is as follows (Endsley,
1995): “Situational awareness is the perception of the elements in the environment within a
volume of time and space, the comprehension of their meaning, and the projection of their
status in the near future.” Further, Endsley (1995) has divided situational awareness into
elements in the environment. Those-task specific elements include the status, attributes,
and dynamics of the environment which are specifically pertinent to effective performance.
Level 2 situational awareness is the comprehension of the situation based on the synthesis
of disjointed Level 1 elements. Most importantly, however, is the fact that this level of
light of the operator’s goals, providing a holistic picture of the environment to the operator.
Level 3 situational awareness is the ability of the operator to project the future actions of the
elements in the environment based on level 2 situational awareness. The highest level of
situational awareness “is achieved through knowledge of the status and dynamics of the
Although relatively little research has been conducted using situational awareness as the
dependent variable, Endsley and Kiris (1995) used an expert-system-aided navigation task
to study the effects of differing levels of automation on workload and situational awareness.
Using five levels of automation, the authors hypothesized that both workload and
decision time upon automation (expert system) failure, b) decision selection, c) decision
confidence d) workload and e) situational awareness, the authors found that as the
31
automation level went up, decision time following an automation failure also went up.
Further, situational awareness also went down as the level of automation went up,
awareness was affected by automation, leading the authors to speculate that subjects who
relied on automation may not have developed a higher level of understanding of the
research and anecdotal findings (Billings, 1991) that automation does not necessarily
correlate with reduced workload. Surprisingly, higher confidence levels corresponded with
higher levels of automation, even though their decision times were longer and situational
awareness lower.
Whether or not one supports the use of situational awareness as a theoretical construct, or
as merely a general descriptive concept, there is little doubt that the research that has been
done has successfully captured some difference in operator knowledge based on automation
level. Further, in terms of conceptualizing and communicating the nature of the “out of the
loop” performance problem, research in this area has been beneficial. Most importantly,
however, if one looks at this research as part of a body of research which has attempted to
measure operator performance in terms of level of automation, the findings are generally
consistent with other research in the field demonstrating reduced operator efficiency when
placed “out of the loop” (Johannsen, Pfendler, & Stein, 1976; Kessel & Wickens, 1982;
32
Mental Models
The concept of the “mental model” as an explanatory device for human cognition is not a
new one, nor is interest in mental models exclusive to cognitive and engineering
psychology (Wilson & Rutherford, 1989). In fact, mental models have been used as an
explanatory construct in manual control literature for over thirty years (Rouse & Morris,
1986). This body of literature commonly used the phrase “internal model” to describe the
“images” that individuals use to organize and execute daily procedural activities or to
operate complex devices (Jaginski & Miller, 1978). While the originator of the mental
model notion is likely Kenneth Craik (1943), Johnson-Laird (1983) instantiated and
popularized the notion of the mental model (and in fact the more sophisticated and
embrace of this concept by cognitive psychology in the early eighties (Rouse & Morris,
1986). Interestingly, however, while the manual control commentary viewed this concept
as generally self evident (Rouse & Morris, 1986) and therefore a suitable assumption which
more directly on the “mental model” as a phenomenon (Rouse & Morris, 1986), even
processes (Wilson & Rutherford, 1989). Norman (1988) explained peoples’ interactions
characterized as the appropriate model which a system designer desires the operator to
have, versus the mental model, which is what the operator actually develops through device
interaction.
33
Even though the use of the concept of a mental model is fairly common in the literature, it
has suffered from a lack of explicit definition (Rouse & Morris, 1986). Johnson-Laird
(1981) stated “A [mental] model represents a state of affairs and accordingly its structure
[which] plays a direct representational or analogical role. Its structure mirrors the relevant
aspects of the corresponding state of affairs in the world.” Rouse and Morris (1985) have
defined a mental model as: “mechanisms whereby humans are able to generate descriptions
of system purpose and form, explanations of system functioning and observed system states,
and prediction of future states.” Carroll and Olson (1987) have defined mental models as
“a rich and elaborate structure, reflecting the user’s understanding of what the system
contains, how it works, and why it works that way. It can be conceived as knowledge about
the system sufficient to permit the user to mentally try out actions before choosing one to
execute.” Borgman (1986) summarizes the perspective of the research in the human
cognitive mechanism for representing and making inferences about a system or problem
which the user builds as he or she interacts with and learns about the system. The mental
model represents the structure and internal relationship of the system and aids the user in
understanding it, making inferences about it, and predicting the system’s behavior in future
states that a well developed mental model provides: (a) knowledge of relevant system
elements that direct attention and classify information in the perceptual process, (b) a
mechanism for projecting future states of the system based on its current state.
Regardless of the specific author, however, most definitions contend that a mental model is
a form of subjective representation of external reality, and allows implicit use of the model
34
for the purpose of “thinking” about the system. This fortunately renders them functional
and affords the user some explicit, though limited, ability to consciously run the model.
Equally important, however, is the notion that a user’s mental model is seldom a perfect
analogy to the real system, and is “surprisingly meager, imprecisely specified, and full of
inconsistencies, gaps, and idiosyncratic quirks,” and quite often possesses blatant
The purpose of this discussion, however, is not to review the relative merits and theories of
mental models, but rather to discuss the way in which the general conceptualization of
mental models is useful in understanding the “out of the loop” performance problem in
highly automated aircraft. Aviation presents itself as a unique domain for the study of
mental models primarily for two reasons. First, in aviation nearly all of its operators,
especially those in the commercial domain, can generally be considered domain experts.
Further, not only are its participants highly trained and versed in aviation related concepts,
but all must perform a nearly identical task. This is not to say that all pilots have identical
mental models, or that their models are a perfectly balanced representation of the real
system. However, as a population of experts they most certainly have very similar models
of the system, and their models are by necessity fairly accurate representations.
The second factor that makes aviation unique is the high level of complexity which must be
part of the flight task mental model. Not only must the pilot’s model include the traditional
manual controlling model in order to fly the aircraft, but the pilot must also have the
aircraft systems, airspace system, air traffic control system, communication, navigation, and
most importantly, the current dynamic state of the aircraft in relation to all the other
systems as part of that model. This notion is not unlike the perspective held by Williams,
Hollan, and Stevens (1983) that mental models are composed of autonomous objects with
35
an associated topology; an autonomous object being a mental object with an explicit
its topological connections to other objects. In addition, I propose that there must be two
levels of the same model: a static, schema like model of the system, and a real-time
The static model is much like any operator’s model of a particular device, allowing the pilot
to clearly describe the operations of all the systems and the relationships between those
systems. When flying, however, the static model is the basis for the activation of the
dynamic execution. The dynamic execution is, in essence, the activation of the static model
with variable data entered into hypothetical “slots.” The activation of this model, however,
is not uniform, but rather a system with varying levels of activation in which components of
the model that are required for efficient task completion are most activated, with those
accomplishes the task, those areas of the static model which have become activated remain
that way for some time even when the task no longer supports the activation of the model.
The areas of activation provide the pilot with quick and easy access to those areas, and
benefit the pilot through more efficient cognitive and perceptual processing of features
For example, a pilot, while on the ground, can explain the relationship between pitch,
power, altitude and airspeed. While flying, and especially while initiating a descent, the
pilot must use the information is this model, in combination with elements of the present
dynamic environment (i.e., current airspeed, throttle setting, pitch and altitude) in order to
execute the descent properly. I contend that during the execution of a task, in this case the
execution of a descent, the relevant portions of the static mental model and all their
36
associated elements become activated. Not only does activation allow for proper execution
of the task but, according to Endsley’s (1995) description of a well developed mental model,
“the model will provide (a) for the dynamic direction of attention to critical cues, (b)
expectations regarding future states of the environment (including what to expect as well as
what not to expect) based on the projection mechanisms of the model, and (c) a direct,
The unique feature of the commercial pilot, however, is that he has a well developed mental
model of the flying task, yet there are frequent examples (this paper and Billings, 1991) of
pilots failing to perceive and integrate information as would be expected given the quality
of their mental model (and its supposed level of activation). Most importantly, Endsley
(1995) points out that a mental model should provide for the dynamic direction of attention
to critical cues. Yet it often seems that pilots fail to attend to critical and sometimes life
threatening cues which should be perfectly salient. This dynamic execution theory of
mental models predicts that any weakening of activation would hinder the operator’s ability
to perceive critical elements in the environment, and would thus lead to conditions in which
critical cues are not perceived and integrated into useful information.
The proposed theory is somewhat similar in vein and at least tangentially related to
Neisser’s perceptual cycle (Neisser, 1976) as put forth by Adams, Tanney, and Pew (1995)
views perceptual acuity and efficiency as a function of cognitive structures available at the
time of perception. Neisser states, “Because we can see only what we know how to look for,
it is these schemata (together with the information actually available) that determine what
information, that enable him to accept it as it becomes available” (p. 20). The cyclic nature
37
of Neisser’s theory implies that each perceptual event results in a modification of the
schema which then “directs further exploration and becomes ready for more information.”
Neisser’s theory suggests that effective perceptual activity is contingent upon the quality
and nature of the previous perceptual cycle. If the activity undertaken by an individual is
different from the activity suggested by the operator’s primary goal, then the perceptual
cycle which proceeds may be ineffective for guiding perceptual activity. According to the
dynamic execution theory, a mental model which remains fairly static (e.g., when the
operator inactively monitors for long durations) will likely lead to a perceptual system
unprepared for the consumption of critical information, or perhaps prepared for the wrong
information. Neisser’s perceptual cycle (1976) also suggests that as the operator’s task
shifts there should be a transitory period during which an inadequate perceptual cycle must
The next section will review previous research from controlled empirical studies which
examined human monitoring behavior in manual and automated systems, some of which
allude directly to the notion of mental models or similar concepts. In fact, Endsley and
Kiris (1994) suggest that some forms of manual control may lead to “maintenance” of an
operator’s mental model. While certainly related, such suggestions are problematic given
the quality of the mental model possessed by experienced pilots. Further, although there is
an implication in some studies that manual control improves cue sensitivity (Johannsen,
1976; Kessel & Wickens, 1982; Wickens & Kessel, 1979-1981; G. Young, 1995; Young,
I hope not only to shed some new light on the results of past research using the dynamic
execution theory, but present new experimental results which further support the hypothesis
38
that a prerequisite of effective sensitivity to key elements in a dynamic process
environment, and correct integration of and response to those elements, is contingent upon
Relevant Research
Given the proliferation of automation in modern cockpits, and the anecdotal and theoretical
support for the view that automation in cockpits should be approached cautiously (Billings,
1991), there is surprisingly little controlled, empirical research dealing with this issue.
Most of the research comparing monitors and controllers in automated, dynamic tasks has
employed tracking or flight control tasks with simulated flight dynamic shifts implicating
control system failure (Johannsen et al., 1976; Kessel & Wickens, 1982; Wickens & Kessel,
1979-1981; G. Young, 1995; Young, 1969) or actual flight tasks with a failure of the
automated system (Eprath & Curry, 1977). Others have used cognitively oriented decision
making tasks (Endsley, 1995; Parasuraman, Molloy, & Singh, 1993; Thackray &
Touchstone, 1989). Findings from these flight- and tracking-task experiments have
(Endsley, 1995; Parasuraman et al., 1993; Johannsen, Pfendler, & Stein, 1976; Kessel &
Wickens, 1982; Wickens & Kessel, 1979-1981; Young, 1995; Young, 1969) although some
have found either no difference (Thackray & Touchstone, 1989) or a failure detection
advantage for monitors (Eprath & Curry, 1977). These problematic research findings have
been attributed to a task which was unnecessarily biased for the system monitors (Young,
1995), having an experimental paradigm in which workload was too low (Parasuraman,
Molloy, & Singh, 1993 ) or experimental trials that were too short in duration (Thackray &
monitors.
39
The methodological approach used in the present research is based on studies which found
superior failure detection performance for manual controllers on a tracking task (Kessel &
Wickens, 1982; Wickens & Kessel, 1979-1981; Young, 1995; Young, 1969). More
importantly, those subjects who controlled manually were also better at detecting failures
when they then transferred to the monitoring task. The transfer effects found in this
research offer the strongest evidence that differences between monitors and past-controllers
may be related to differences in subjects’ mental models of dynamic systems. The next
section will therefore focus primarily on related experiments in which monitors and
controllers were compared in a transfer condition. These findings are both theoretically
and operationally more significant and are the basis for the current research.
Paradigm History
Young’s (1969) single axis tracking task which found superior performance for controllers
was improved and expanded by Wickens and Kessel (1979) who designed a similar
experiment using a two dimensional pursuit tracking task to increase task complexity.
They also addressed concerns that Young’s (1969) auto-pilot methods may have been
of failures easier for monitors. Their results demonstrated that as subjects switched from
accuracy increased slightly. Wickens and Kessel (1979) determined that the superior
in the first few seconds after the onset of a failure. Because of the short, transient nature of
40
a proprioceptive standard, detection must occur within the first seconds, otherwise subjects
resorted to the visual channel for failure information. This finding further strengthened
their argument that the controller advantage was a result of proprioceptive feedback.
that superior performance may have been due, in part, to a more consistent conceptual
system should enhance the subject's ability to detect deviations from a normal state. This
was based on the view that a conceptual model of greater consistency developed as a result
of the controller's ability to differentiate between one’s own inputs and those acting upon
the system externally (e.g., turbulence). In addition to having an internal model of greater
consistency, it was believed that controllers in a system could test hypotheses about the
general state of the dynamic system through subtle system inputs, reinforcing and testing
In order to determine the role of workload in failure detection performance, Wickens and
Kessel (1979) employed a secondary task in their experiment. As the side task was added
and its difficulty increased, no marked decrease occurred in detection performance of either
workload. Instead, higher levels of workload shifted the speed-accuracy bias toward speed
Wickens and Kessel’s (1979) finding raised the question of why the increased workload of
the manual tracking task did not have a negative impact on failure detection performance
like that found by Ephrath and Curry (1977). Wickens and Kessel (1980) pointed out that
the manual tracking task and the failure detection task may not be competing for the same
41
resources, as had been previously believed. Moreover, the activation of the resources
allocated to the tracking task were those that allowed subjects to utilize proprioceptive
feedback in the detection process. This suggested that these operations work in
cooperation, rather than in competition, with each other. Using the same experimental
allocation, Wickens and Kessel (1980) concluded that controlling and monitoring actually
rely on different processing resources to detect failures. Failure detection while monitoring
task. However, while controlling subjects rely on a response-related reservoir separate from
mentioned studies (Wickens & Kessel, 1979; Young, 1969) employed repeated measures
designs that had subjects perform both monitoring and controlling of the failure detection
representation while controlling, this advantage would have been available to the subjects in
Wickens and Kessel (1979) hypothesized that concurrent development of both a controlling
and a monitoring conceptual model negatively affected the performance of controllers. This
was based on evidence suggesting that visual information caused a reduced sensitivity to
proprioceptive information, especially when the two sources contradicted each other
(Posner, Nissen, & Klein, 1979). Therefore, because of a strictly visual-cue based model
developed while monitoring, subjects may have had the tendency to rely on faulty visual
cues while controlling. This bias toward visual cues when the two information sources were
42
in conflict therefore negatively affected the performance of controllers. Of course,
Kessel and Wickens (1982) isolated the impact of subjects' conceptual representations on
design. In this study, three groups of subjects were used: the first group transferred from
controlling to monitoring, the second group transferred from monitoring to controlling, and
the third group monitored in both sessions. Consistent with expectations, monitors took
longer to respond to system failures and made more errors than controllers. Further, the
approximately five times that found in the previous repeated measures designs, thus
confirming the view that the monitor/controller conceptual model bias had been
however, was the significant increase in the performance of subjects during monitoring who
had controlled during the first session (Kessel & Wickens, 1982). This result indicated that
controlling not only led to the development of a conceptual model that aided in detecting
failures, but that the model was powerful enough to affect performance on a task which no
longer supported the features of that particular conceptual model. From the standpoint of
dynamic process control, this finding suggests that many of the benefits of automation can
be utilized while allowing the operator, through proper training, to maintain a conceptual
model optimal for detection of subtle changes in system performance (Young, 1995).
Kessel and Wickens’ (1982) transfer of training design was replicated by Young (1995),
who improved on the design by implementing a yoking procedure that insured identical
43
visual stimuli for both controllers and monitors, thus eliminating auto-pilot induced biases.
Further, Young (1995) addressed concerns that Kessel and Wickens’ (1982) transfer effects
may have been attributable to simple vigilance factors and not conceptual model differences
by creating a condition with a high rate of failures and a very short trial length (80 failures
in just over six minutes). If the earlier studies’ results represented merely vigilance related
effects, then the results of a very short experiment with a high rate of failures would not
Young (1995) successfully replicated Kessel and Wickens’ (1982) results, showing that
when active controllers are transferred to the monitoring task they are better at detecting
failures than subjects who only monitored. This was additional evidence that features of the
controlling task transfer to the monitoring condition, and Young’s (1995) yoking
methodology insured that both controllers and monitors, when compared directly, received
identical visual stimuli. Young (1995) also found a nearly identical pattern of results when
the experiment was reduced in length from 45 minutes to just over six minutes. This
finding further supported the hypothesis that the improved failure detection performance
was due to an improved conceptual model guiding focus to relevant visual cues.
Present Research
Taken together, the results of Kessel and Wickens (1982) and Young (1995) strongly
suggest that individuals who control a simple dynamic system have an advantage in
detecting failures of that system when monitoring compared to individuals who only
monitor. Further, this research suggests that controllers develop a conceptual model of the
system which makes them more sensitive to subtle cues implicating system failure.
Although these findings are significant, they are limited in scope given the largely psycho-
44
motor nature of the tracking task employed. Although controllers may in fact have a more
effective “conceptual model” of the system, this model bears little resemblance to the
Although the pilot of an aircraft, for example, may have a motor schema for manual control
of the aircraft, this is but one component of a mental model of far greater complexity. An
operator of a two dimensional tracking task has essentially one display to guide his control,
yet the aircraft pilot has multiple displays to track, not to mention 6 six degrees of freedom
rather than two, and out-of-cockpit, tactile, and aural information to guide his control.
replicate the findings of Kessel and Wickens (1982) using a more complex, non psycho-
motor aviation-like dynamic task. This experiment not only seeks to replicate the original
finding that controllers show better monitoring performance, but also to validate this
paradigm as an improved experimental platform for exploring the idea that a better
The primary objective in the design of the experimental paradigm was to create a task
tasks, the monitor collects data from the display, each sample being regarded as a
monitor of this system, the operator entertains a “rolling null hypothesis” that system
parameters have not changed, but responds when some change in the parameters has been
detected.
Although the particular task is generated from aviation type components, the combination
of these tasks is synthetic, and simple enough so that an individual can acquire the basic
45
principle and operational requirements in a half-hour of training. The task is, however,
highly analogous to many forms of dynamic process control where a failure of some sort is
not reflected in a single value, but rather in an apparent shift in the population mean
(Wiener, 1984) and thus inferential in nature. The task was designed so that failure
detection requires a synthesis of several features of the task making detection from a single
Based on the view that aircraft pilots have a reasonably complex mental model of the flying
task and that numerous subtleties are built into this model (e.g., the sensory stimuli one has
while initiating a descent), every effort was made to include operational subtleties as part of
the system. These subtleties would, at least in theory, become part of the operator’s mental
model. Further, a mastery of these subtleties would enhance one’s ability to infer a failure
since system subtleties initially have the effect of masking actual system behavior, but lose
and eventually reverse this effect as proficiency with the system increases.
Creating a paradigm that requires inferential monitoring for effective failure detection
would provide evidence that a more effective mental model can assist in the detection
process. However, such a finding would not necessarily exclude a general vigilance
explanation. Therefore, a second failure detection task was added which would represent
the more traditional signal/no signal vigilance task. This failure type was represented by a
bold red indicator surrounding a fuel pump and is analogous to a sub-system indicator light
illuminating in a cockpit. Because indications of the failure are explicit and unpredictable
In addition to the two failure types, this experiment employed two different auto-pilot types.
The first type of auto-pilot was the “yoked” type as used originally by Young (1968) and
46
later Young (1995), in which monitors’ visual stimuli consisted of recorded representation
visual stimuli received by both controllers and monitors are identical in all conditions.
However, it has the disadvantage of providing visual stimuli which, in terms of auto-pilot
like behavior, is unrealistic. Thus, effects found could be criticized in terms of validity,
since monitors of real dynamic process control systems typically see the system operated in
an optimally efficient manner. For this reason, a third condition was added that used an
highly efficient manner so that fuel levels were always within the “safe” areas, and the
findings that past controllers make efficient monitors (Johannsen, et al., 1976; Kessel &
Wickens, 1982; Wickens & Kessel, 1979-1981; Young, 1995; Young, 1969). This research
is thus somewhat exploratory in nature. However, it is expected that the findings will again
show that controllers, when compared directly with monitors, are superior at detecting
failures. It is possible, however, that the higher workload resulting from the controlling
task will negate some of the typical benefits of controlling (e.g., hypothesis testing,
More importantly, however, it is expected that controllers will be more efficient monitors
when compared to individuals who monitor in both conditions. Further, these differences
should appear only in the inferential monitoring task, and not in the simple explicit
detection task. This expectation is a result of the hypothesis that the improved controller
47
performance observed in the past is due to an activated mental model guiding subjects to
subtle system cues. It is also hypothesized that any differences between controllers and
monitors will be seen in both yoked and optimized auto-pilot conditions, since both auto-
pilot types have in the past shown differences between monitors and controllers.
48
EXPERIMENT 1
Method
Subjects
Thirty-eight right-handed male university students were used in the experiment. Students
were paid a base rate for their participation in the experiment. Additionally, subjects were
given the opportunity to earn a five dollar bonus for good performance. All subjects had
Apparatus
A 50 MHz Intel 486 PC with a 17 inch color CRT display was used. A spring centered,
dual-axis hand control (CH Products FlightStick) with a finger operated trigger was
connected to the PC via a 12 bit A/D converter. The subjects sat in a cushioned, semi
reclining chair, with a rest supporting their arm and the “joy stick.” The seating position
yielded an eye to display distance of approximately 100cm. The room containing the
apparatus was darkened, with primary light being provided by a red bulb for the purpose of
Task
A discreet, single dimension tracking task was used in combination with a fuel
management task in the aviation-based simulation (see Appendix A.) The display
contained a “pictorial” representation of an aircraft fuel system with tanks in each wing,
two in the front, and two in the rear of the aircraft. Fuel tanks were interconnected with a
49
series of symbolic fuel lines showing fuel flow direction, and boxes on the fuel lines
represented pumps which were either on or off. The fuel management portion of the task is
similar to the Multi-Attribute Task Battery (Comstock & Arnegard, 1992) fuel management
task used by Parasuraman et al. (1993, 1996). The throttle level and recommended throttle
setting which made up the discrete tracking task were located in the right portion of the
display and the aircraft’s speed was displayed digitally in the nose of the aircraft.
The single dimension, discreet tracking task required subjects to use the joy stick in order to
match the aircraft’s current throttle level with the “recommended throttle setting” level.
The current throttle setting was indicated by a yellow bar, while the “recommended throttle
setting” was indicated by an adjacent blue bar. Throttle position directly controlled the
displayed speed of the aircraft, which was explicitly displayed in the nose of the aircraft but,
more importantly, throttle position controlled the amount of fuel consumed by the aircraft.
The relationship between throttle position and speed was linear, but the relationship
between speed and fuel consumption was non-linear. Therefore, higher throttle positions
speed/fuel consumption relationship meant that doubling the speed, for example, resulted in
The fuel management task involved the on/off manipulation of six fuel pumps which
controlled fuel flow between fuel tanks. Subjects manipulated the fuel pumps by toggling
keys on the keyboard which were both mapped to the general layout of the fuel pumps, and
were labeled with a specific fuel pump number. The fuel management task required
subjects to manipulate the fuel transfer pumps in order to keep fuel levels in the four main
tanks at “safe” levels, indicated by yellow bars on the fuel tanks. Subjects were told that
50
their task was to pump fuel out of the wing tanks and into the front and rear tanks so that
The task was made more difficult by three subtle features of the system. First, as mentioned
earlier, although fuel depletion from the rear tanks was controlled by the speed of the
aircraft, the relation between aircraft speed and fuel consumption was non-linear, so that
subjects had to pay close attention to the throttle level in order to predict fuel consumption.
Second, the fuel tanks, although pictorially similar in size, had different fuel capacities so
that a single pump would have a different effect on the amount of fuel displayed between
the two tanks being affected. Third, the fuel pumps had different flow rates, so that a
pump’s flow rate was contingent upon the location of that pump in the fuel system.
Two types of failures occurred in the system, each representing a different type of fuel
system failure (see Appendix A.) The first type of failure, the signaled pump failure, was
indicated by the symbolic pump border changing from thin white to a highly salient thick
red. Subjects had five seconds to detect this failure. If the failure was detected, or time
expired, the red border returned to white. Subjects were told that a pump failure indicated
a problem with a pump, but that pressing the trigger returned the pump to normal
functioning. A pump failure was totally unrelated to the pump or fuel tank behavior, and
was thus only detectable by the change in the fuel pump border.
The second failure type was the inferential failure, called a “pressurization” failure, and
was indicated by abnormal behavior of fuel levels within the four fuel tanks. A
pressurization failure occurred when the fuel level in one of the four main tanks increased
or decreased in a manner inconsistent with what would be expected given: a) fuel pump
activity and b) rate of aircraft’s fuel consumption. This task was made more difficult by the
51
subtle system features mentioned previously. Subjects had 16 seconds to detect a
pressurization failure. If the failure was detected, the abnormal fuel flow behavior stopped.
If the failure went undetected during the 16 second failure duration window, the abnormal
fuel flow ceased and the tank level remained at its new level.
Experimental Design
Three groups participated in the transfer of training, between-subjects design. The first
group controlled the first day of the experiment, the second group monitored in the
“optimized” auto-pilot condition, and the third group monitored in the “yoked” auto-pilot
condition. The second day (transfer day) all three groups monitored both the auto-pilot and
yoked conditions in four 14 minute trials with two counterbalanced trials of each
monitoring condition.
The experimental portion of each day consisted of four 14 minute trials with a two minute
break between each trial. Each 14 minute trial had seven pump failures and seven
pressurization failures. Failure type and failure sequence was randomized, and time
between failures was between 20 seconds and three minutes. (See Figure 1).
52
Participatory Mode, Experiment 1
Session 1 Session 2
Day 1 Day 2 (monitoring)
Controllers Auto-pilot
Control
Yoked
Auto-pilot
"Auto-pilot" monitors
Auto-Pilot
Yoked
Auto-pilot
"Yoked" monitors
Yoked
Yoked
Training
The training consisted of part-and whole-task practice for the first thirty minutes of the first
day. Subjects either received practice controlling or monitoring each component task, then
received practice with the whole system, first with performance feedback, then without.
After the practice session subjects were instructed to ask the experimenter if they had any
questions about the task, and all questions were answered. Subjects were also given ten
additional minutes of monitoring training at the beginning of the second day. All subjects
saw both auto-pilot types. Subjects were told that during the experiment they would be
Considerable emphasis was put on how the system operated in terms of its “structure and
processes” (Kieras & Boviar, 1984). The mechanics of the system were explicitly explained
53
(e.g., “Pump P1 controls fuel flow from the left wing tank to the front fuselage tank.”), the
subtleties of system behavior were explained (e.g., “Pump P1 has twice the fuel pumping
capacity as pump P3.”), and the concept of the system was explicitly explained (e.g.,
“airplanes are sensitive to the location of weight therefore making it important that fuel be
This was done so that subjects developed a complete mental model of the system, as
emphasized in research comparing operators with and without mental, or “device,” models
(Kieras & Boviar, 1984). Although the training received by controllers and monitors was
different in the specific level of control, every effort was made to insure that all other
elements of the training (e.g., training time and level of explanation of the dynamic system)
were identical.
Results
Between- and within-subjects comparisons were made using signaled failure reaction time
(RT) and a combined RT and error rate measure for inferential failures. Analyses of
variance (ANOVA) were used to test for group differences and interactions for both
signaled and inferential failures. The combined performance measure for inferred RT was
used for the purpose of managing between-subjects variability common with complex
dynamic task performance (Parasuraman, 1986). Further, as discussed in the next section,
its use was based on precedent (Kessel & Wickens, 1982; Wickens & Kessel, 1979-1981)
The use of an efficiency index is based on the assumption that, “subjects aggregate evidence
over time concerning the discrepancy between the sampled-system behavior and the
internal model of a non-failed system, until this evidence exceeded an internal decision
54
criterion. Detection efficiency is reflected in the rate of aggregation of internal evidence,
independent of the criterion setting,” (Wickens & Kessel, 1980, p569). Therefore, because
detection efficiency is both fast and accurate, it should be reflected in an index integrating
both measures.
Although indexes used in earlier research have been described as “somewhat arbitrary,”
(Wickens & Kessel, 1980, p569), every effort was made here to remove the arbitrary nature
of the weighting method, while still combining the measures and reducing overall RT
variability. Therefore, it was decided that a weighting scale would be used in which RT
was divided into either fast (>8000ms) or slow (<7999ms), since eight seconds was very
close to the grand mean for the experiment. “Fast” RTs were scored as 1, RTs which were
“slow” were scored as 2, while misses were scored as 3. This created an ascending
combined RT/error rate scale in which optimal performance generated a 1 (every failure is
detected in less than 8 seconds), and bad performance received a 3 (every failure event was
missed). By using only three consecutive levels in the index, misses were appropriately
weighted as significantly worse than long “hits.” While the 16 second “hit” window is
somewhat arbitrary, it was considered acceptable because inferred failure detection in a real
task would be highly task and failure dependent. Further, detection performance in this
task is based on a continuum, so the actual window size is not particularly meaningful.
However, if this paradigm were based on a real operational task with real failures, this “hit”
window would take on considerable meaning. It was believed that this index successfully
reduced variability, yet was far less arbitrary than other weighting methods. The raw RT
55
Signaled Failures
Simple RT findings were generally contrary to expectation (see Figure 2.) The yoked
monitoring group was marginally faster at detecting simple failures than was the controller
group (974 vs. 1172 ms) when compared directly [F(1,24) = 3.3, p < .1] in Session 1. The
optimized auto-pilot group was not significantly faster than the controllers (1120 vs. 1172
ms) nor significantly different from the yoked group (1120 vs. 974 ms).
In the transfer condition (Session 2), the yoked group was marginally faster than the
controllers (873 vs. 1103), [F(1,24) = 2.01, p < .15]. Although this effect is weak, it is
reported because it is highly contrary to expectation. The auto-pilot group was also quicker
than the controllers (939 vs. 1031 ms), although this effect was not significant. The
difference between the yoked group and the optimized auto-pilot group (873 vs. 939 ms)
1200
1100
Signaled Failure RT
1000
900
Controllers
800
700 Auto-pilot
600 Yoked
500
400
300
200
Day 1 Day 2 Auto-pilot Day 2 Yoked
56
Inferential Failures
Figure 3 shows inferential failure detection results. Controllers were significantly better at
detecting inferential failures than the optimized auto-pilot group in Session 1, (2.535 vs.
2.676), [F(1,26) = 4.43, p < .05], but not significantly better than the Yoked group, (2.535
vs. 2.65). The optimized Auto-pilot group was not significantly different from the Yoked
In the transfer condition, in which all subjects performed both of the auto-pilot tasks,
controllers did not perform significantly better than either of the auto-pilot groups, although
all means were in the anticipated direction. The Controllers, when compared to the
optimized Auto-pilot group, were not significantly different (2.619 vs. 2.699), nor were the
Controllers different from the Yoked group when compared on the yoking task, (2.577 vs.
2.593), [F(1,25) = .58]. Interestingly, the Yoked group was better than the Auto-pilot
group at the Auto-Pilot task in the transfer condition (2.66 vs. 2.7) although this difference
2.7
Inferred Failure Detection
2.65
2.6 Controllers
Auto-pilot
2.55
Yoked
2.5
2.45
Day 1 Day 2 Auto-pilot Day 2 Yoked
57
Figure 3. Session 2 Inferential Failure detection performance (combined index).
Discussion
The results of Experiment 1 only partially supported the experimental hypotheses, yielding
both expected, and unexpected findings. The finding most consistent with previous
research was that subjects who controlled were significantly better at detecting inferential
failures than were the Auto-pilot monitors, and marginally better than Yoked monitors in
Session 1. This finding, although consistent with past research showing that controllers,
when compared directly with monitors, are better at detecting failures, was not entirely
predicted from the hypothesis given that the higher workload levels present when
controlling could have interfered with failure detection. In this particular task, the
only indirectly related to failures. In the experiments using tracking tasks, however,
proprioceptive feedback was a direct indication of system failure and therefore a highly
salient cue. Proprioceptive feedback is therefore not considered a distinct advantage for
Subjects could “hypothesis test” in a failure condition as in past research using tracking
tasks, and this may have been a distinct advantage for Controllers. When Controllers
sensed illogical system behavior, they could test their hypothesis through pump or throttle
manipulation to see if their own inputs resulted in continued illogical system behavior. Post
when detecting failures. In addition, although the Controllers’ overall failure detection
performance was marginally better than the Yoked group, there was a slight and
nonsignificant speed accuracy trade-off in Session 1 (see Appendix B) which may have
58
been a result of Controllers taking the extra time to hypothesis test prior to signaling a
failure, causing their RT to be slightly greater and their accuracy significantly better. It is
also possible that Controllers took advantage of a more activated mental model of the
system, and were thus more sensitive to illogical system behavior. However, when
operator’s mental model activity given the other possible explanations for this advantage.
Reaction times for the signaled failures were generally consistent with the hypotheses.
activity in this experimental paradigm, it is likely that signaled failures are an effective
measure of workload. In fact, signaled failure RT was marginally faster for Yoked
monitors than Controllers, probably reflecting the lower workload levels for the Yoked
group. It is also possible that the greater latency for Controllers may have been the result of
subjects’ need to scan a greater portion of the display in order to perform the sub-task of
matching the throttle with the recommended level. Thus, it is possible that the shorter RTs
of the Yoked group were because subjects spent more time focused in the center of the
display where both failure types occurred, rather than switching their focal point to the
throttle display area on the periphery of the display. Although the Auto-pilot monitors also
had lower workload levels than the Controllers (and perhaps even lower than the Yoked
monitors) their signaled failure RTs were not significantly slower than the Controllers.
all conditions associated with the auto-pilot monitoring task. Although not central to this
research, the weak auto-pilot performance will be discussed further in the following
paragraphs.
59
Results from Session 2 did not generally support the hypothesis that system controllers are
better monitors than subjects who monitored in Session 1. Although group means were in
the predicted direction, there were no significant differences between the Controllers and
the Auto-pilot or Yoked monitors. Controllers were slightly better than the Auto-pilot
group when transferring to the yoked condition, but this difference was marginally
significant at best (p < .13). Further, this result is more likely a result of very poor
performance by the Auto-pilot group in the Yoked condition, rather than good performance
by the Controllers. The only significant finding for Session 2 was that the Yoked group
performed better than the Auto-pilot group when transferring to the yoked condition. This
finding was not surprising given that the Yoked group had previous experience with the
yoked condition and the Auto-pilot group did not. However, one would expect the opposite
There are several possible explanations for the Controllers’ failure to perform significantly
better on inferred failure detection tasks than the two monitoring groups in Session 2.
While past tracking task research used two days for Session 1, not including training
(Kessel & Wickens, 1982; Wickens & Kessel, 1979-1981; Young, 1995), I believed that the
cognitive nature of this paradigm would allow it to be learned more quickly than the subtle
motor skills required in a difficult tracking task. This assumption was incorrect, however,
as post-experiment interviews and experimental data suggested that the task was actually
quite difficult to learn and perfect, and that the training time had not been sufficient for
subjects to master the task. In fact, some subjects suggested that they were still learning the
task well into Session 2. Further complicating this picture is the fact that the high
workload in the controlling condition (in both training and during Session 1) may have
made it more difficult for Controllers to learn the task as compared to the two monitoring
groups. Given that the group means were in the predicted directions, it is possible that the
60
Controllers did have an advantage in detecting inferential failures but, because of the
learning issues, this difference was not strong enough to generate a significant effect.
The experimental hypotheses stated that there would be no effect of prior experience for
signaled failure reaction times in Session 2. This prediction was based on the theory that
any advantage during monitoring afforded to past controllers was due to a more activated
mental model, and thus would not affect signaled failure detection performance. However,
RTs for signaled failures were marginally affected by condition. The Yoked monitor group
was faster (at a marginally significant level) than the Controller group while performing the
yoked monitoring task. The Auto-pilot monitors were also faster than the Controllers,
although not significantly. Although the difference between Controllers and Yoked
monitors was not significant, it is unexpected and therefore quite interesting, and will be
explored further in the subsequent experiment. The most salient explanation for this
finding seems to be that Controllers scan the display more diligently than the Yoked
monitors, and therefore spend less time focused on the center of the display where the
The theory that system controllers scan more effectively is further supported by the fact that
Controllers may have performed better on the inferential failure detection task. This
improved failure detection performance could have been the result of Controllers
integrating subtle cues from the system more effectively and therefore being more sensitive
to system abnormalities. Importantly, this integrating process would likely use information
from the throttle display in forming a diagnosis. It therefore seems that if Controllers are
more sensitive to the system operations as an integrated unit, they spend more time focusing
on the throttle display accessing the important throttle information and less time focused on
61
Implications of this finding are that subjects who have controlled, and are likewise
benefiting from controlling experience while monitoring, seem to be spending more time
scanning the display for useful information. Given that the throttle provides subtle clues
about system behavior, the Controllers should have derived a failure detection advantage if
they were allocating more time to studying its impact on the system. However, the
Controllers were not significantly better on the inferential failure detection task than the
other groups. This implies that the information provided by the throttle was not valuable
enough to improve inferential failure detection performance for those who observed it. It is
possible that while throttle information may have been advantageously used by Controllers,
This explanation may provide some clues as to why accidents involving controlled flight
into terrain with auto-pilot engaged were not detected by the pilots even though there was
ample evidence of impending disaster. This finding also suggests that if one of the benefits
may be diagnostic of the effects of the “out of the loop” performance problem suggested by
Smolensky (1993). It is also possible that Controllers, because of their extensive experience
with controlling the throttle, gained a greater understanding of the relationship between the
throttle and fuel system behavior, and were hence more inclined to observe throttle activity
even when monitoring the system. This explanation supports the contention that active
controllers scan more while monitoring because they have developed a different strategy,
As mentioned previously, two different monitoring groups were used to address concerns
that the experimentally superior “yoked” auto-pilot method may induce differences
62
compared to an “optimized” type auto-pilot. However, even the optimized automation is
completely task dependent. The optimized system for this task was based on the view that
aircraft automation is extremely consistent and rigid in the way it controls the various
systems. Therefore, the optimized system for this task consistently held fuel levels in the
“safe” zones, and operated the pumps in a rigid operational sequence to maintain correct
fuel levels. Additionally, the throttle setting was automatically maintained at the level of
Results from the experiment suggest that the hypothesized effects are not unique to the
“yoked” condition. In fact, in nearly all conditions the performance of the “optimized”
Auto-pilot group was worse than the Yoked group. This suggests that the effects found in
this paradigm are not due to the use of the yoked methodology. In fact, results obtained
using this method may underestimate effects found in applied settings due to the prevalence
of “optimized” automation in operational systems. Since this difference is not the focus of
this research, nor was it consistently significant, it will not be explored further. However, it
may be worth noting that the highly consistent behavior of the optimized system may have
had a numbing effect on subjects, thus pushing them even farther out of the control loop, or
so they perceived, and reducing mental model activation even further. In addition, it is also
possible that the consistency of the automation made them believe the task was easier,
compared to the Yoked monitors, who had to pay close attention to the automated system in
This view is supported by the fact that in Session 2 the Auto-pilot group had marginally
poorer performance on inferred failure detection tasks in the yoked condition, compared to
Controllers, and significantly worse performance than the Yoked monitors. Thus, when
transferring to the more difficult yoked monitoring task, the optimal Auto-pilot group was
63
at a further disadvantage as a result of their experience in the highly predictable optimized
auto-pilot condition. Further, there are indications that the Yoked group performed better
at the optimized auto-pilot monitoring task than the Auto-pilot group, even though this
condition was novel to them. This suggests that some feature of the yoked monitoring task
made Yoked monitors more sensitive to system behavior, thus giving them an advantage
Experiment 1 Conclusions
The results of this experiment were informative and suggestive. Controllers, when
inferential failures. This finding likely reflected “hypothesis testing” and perhaps improved
opposite results reflecting the higher workload of controlling, and the necessity of
Controllers to observe the throttle for control purposes resulting in less time spent focused
inferential failure detection means were in the predicted directions during Session 2,
although these differences were not significant. This finding suggests that a transfer effect
for Controllers may exist, but the experiment as conducted lacked power. The results of
signaled failure performance in Session 2 were surprising, perhaps reflecting the fact that
subjects with experience controlling scan the display more effectively, supporting previous
contentions that mental models play a role in guiding perceptual activity (Endsley, 1995).
However, this difference could also be due to Controllers developing a failure detection
strategy more dependent on the effect of throttle behavior. In either case, Controllers seem
64
to spend more time focused on the throttle while monitoring, and less time focused on the
The primary goal of Experiment 1 was to replicate earlier findings that controllers of a
dynamic system are better at detecting system failures than subjects who only monitor when
was to replicate these findings using a cognitively complex dynamic system management
paradigm. Although the experiment yielded interesting results, the primary objective was
Experiment 2 was designed to both correct the weaknesses of Experiment 1, and to further
explore the surprising findings from the signaled failure detection task. In addition, the
“optimized” auto-pilot condition was dropped from Experiment 2, since the predicted effect
seems to be present using both auto-pilot types and the yoking methodology is
experimentally superior.
additional day for Session 1. The extra day was added to address the anecdotal and
experimental evidence suggesting that subjects were still learning the task well into the
second day (Session 2). I am interested in the transfer effects from a well learned task, and
it is therefore imperative that the task be well learned before subjects switch to the transfer
task. In addition, some subjects suggested in post-experiment interviews that they were
65
confused by the triggering system and the consequent lack of performance feedback, and
that this confusion further hampered their ability to quickly learn the task. Subjects
indicated that because of the subtle nature of the inferential failures, even though the failure
behavior ceased after detection, it wasn’t always clear if a failure had been successfully
signaled. This confusion was exacerbated by the fact that a “false alarm” deactivated the
trigger, so that when subjects positively identified a failure in the same trial, trigger
activation had no effect, leading them to believe that they had improperly diagnosed the
failure.
To address this confusion, a message system was added to the display to inform subjects of
both the state of the trigger (armed or not) and whether or not they had correctly identified
a failure. Not only did this procedural change augment the performance information that
subjects generally assumed on their own but, more importantly, prevented any false
learning resulting from system state misinterpretation. Although this feedback could be
criticized on the grounds that better performers would receive more positive feedback, this
method afforded users the opportunity to learn from both correct and incorrect performance.
Although there was implicit feedback in Experiment 1, it favored individuals with better
performance to an even greater degree, since good failure detection performance likely
meant better system understanding. Therefore, improved system understanding not only
led to better performance, but also more accurate interpretation of implicit system feedback.
The intentional system subtleties included in Experiment 1 were carried over into
Experiment 2, but were exaggerated somewhat to further occlude the inferential failures.
Failure onset was made more subtle, and pump flow-rate differences were exaggerated
slightly. Most importantly, the non-linear relationship between the throttle level and the
rate of fuel consumption was exaggerated, and the recommended throttle level changed
66
positions at a greater frequency, making throttle level monitoring (and controlling) more
demanding. This procedure was used because in Experiment 1 Controllers may have been
spending more time scanning the throttle display while monitoring. If knowledge of
throttle activity is now more important for inferential failure diagnoses, then scanning
detection performance.
To test the hypothesis that Controllers may have poorer signaled failure detection
performance because they spend more time scanning the display for throttle information,
the throttle information was both moved farther to the edge of the display and made slightly
less salient. Both changes were made to increase the time required to effectively scan the
throttle information. This change should exaggerate the signaled failure detection
this difference.
To further explore this issue, the throttle was removed from the display on half of the trials
differences in scanning behavior, then their signaled failure detection performance should
Controllers are using the throttle information to facilitate inferred failure detection, then the
removal of this information should reduce their inferential failure detection advantage.
67
Experimental Hypotheses, Experiment 2
further explore unexpected findings from Experiment 1. I expect the direct comparison
between Controllers and Monitors in Session 1 to again show a small advantage for
Controllers in the inferential monitoring task, as a result of hypothesis testing and perhaps
improved mental model activation, but a disadvantage in the signaled failure detection task
due to higher workload and the need to spend more time focused on the throttle display.
Controllers should also show an advantage over Monitors in the inferential failure detection
task in Session 2, supporting the hypothesis that the heightened activation of the
controllers’ mental models makes them more sensitive to inferential failures when
However, this advantage in inferential failure detection may only be present in the “throttle
visible” condition (see Method). If the activated mental model guides perception (Endsley,
1995), and attention is thus directed to the throttle information on the display because it
provides relevant data for inferring abnormal operation, the absence of throttle information
should impair the failure detection advantage of Controllers. Further, if the poor signaled
their scanning behavior, then the “throttle not visible” condition in Experiment 2 will show
68
EXPERIMENT 2
Method
The Methods section for Experiment 2 highlights only differences from Experiment 1.
Subjects
Psychology courses were used in the experiment. Students received “experimental credit”
and were paid a base rate for their participation in the experiment. Additionally, subjects
were given the opportunity to earn a five dollar bonus for good performance.
Task
The task for Experiment 2 was the same as that used for Experiment 1 except for the
following changes: A trigger and performance feedback message was added to the lower
right corner of the display to address the confusion about system state expressed by subjects
in Experiment 1. “Trigger Armed,” “False Alarm, trigger INOP until reset,” “Failure
with the system state. In addition, the messages were color coded to heighten awareness of
Two changes were made to the throttle portion of the display. First, the throttle was moved
farther toward the upper right hand corned of the display and the “recommended throttle
position” was made less salient by decreasing the width of the indicator bar. Both of these
changes were made to increase the time required to scan the throttle-setting portion of the
69
display. In a related change, the digital aircraft speed was moved from the forward-center
location of the aircraft to the upper left-hand corner of the display. This was done to further
increase the time needed to effectively scan all information components of the display. The
second major change was the removal of all throttle information on half of Session 2 trials.
This change eliminated the need for subjects to scan the periphery of the display, but also
removed information which may have helped them in the inferential failure detection
process.
In order to further occlude normal system operation and thus complicate the inferential
failure detection process, individual pump flow rate differences were exaggerated, the
linearity of the throttle level/fuel flow ratio was degraded, and inferential failures
themselves were made slightly harder to detect. The final change to the task for
Experiment 2 was that the time given to detect a pump failure was reduced from 5 seconds
to 3.5 seconds because the results of Experiment 1 suggested that the extra 1.5 seconds was
unnecessary.
Experimental Design
Two groups participated in the transfer of training, between-subjects design. The first
group controlled the system during the first and second days of the experiment (Session 1)
while the second group monitored a “yoked” auto-pilot during Session 1. On the third day
(Session 2), the transfer condition, both groups monitored the yoked condition and detected
failures. However, in two of the four trials, the throttle information was eliminated from
70
Participatory Mode, Experiment II
Session 1 Session 2
Day 1,2 Day 3, (Monitoring)
Throt Vis
Controllers
Control
Throt NotVis
Throt Vis
"Yoked" monitors
Monitor
Throt NotVis
Training
The training session was 30 minutes at the beginning of Day 1, and was identical to
Results
Between- and within-subject comparisons were made for both signaled failure RT and the
analysis of variance (ANOVA) was used to test for group differences and interactions.
Session 1 data were from Day 2 only unless otherwise specified, as Day 1 was treated as
learning. The raw RT and error rate data for Inferred failures are provided in Appendix C.
71
Signaled Failures, Session 1
the results of Experiment 1, subjects were still learning the task into the second day, as
demonstrated by the significant mean reaction time group effect from Day 1 to Day 2,
[F(1,36) = 13.6, p < .01]. Although the Group by Day interaction was not significant,
Controllers’ improvement was larger from Day 1 to Day 2 in Session 1 (960 vs. 835),
[F(1,17) = 14.45, p < .01], than Monitors (867 vs. 794), [F(1,19) = 3.09, p < .1]. Simple
comparison between Controllers and Monitors in Session 1 (Day 2) was in the predicted
direction but was not significant (835 vs. 794), [F(1,37) = .27].
Session 2 yielded surprising findings for signaled failures. There was a main effect
favoring Controllers over Monitors, [F(1,36) = 4.75, p < .05], and as shown in Figure 5, a
.15]. There were no significant group differences in the throttle Visible condition (667 vs.
728), [F(1,37) = 1.21], but there was a significant difference in the throttle NotVisible
condition (604 vs. 757), [F(1,37) = 6.62 , p < .05]. As expected, Controllers improved
from the throttle Visible to the throttle NotVisible condition (667 vs. 604), [F(1,17) = 6.64,
p < .05], while the Monitor’s mean RT increased, but not significantly (728 vs. 757). (See
Figure 5.)
72
1000
Signaled Failure RT
900
800
700 Controllers
600
500 Monitors
400
300
200
Day 1 Session 1, Session 2 Session 2
Day 2 Visible NotVisible
There was a significant effect for Day in Session 1 favoring Day 2 [F(1,36) = 15.2, p <
.01] with no significant Day by Group interaction, supporting the contention that both
groups were still learning the task after the first day. This is also supported by the false
alarm data which showed a significant reduction by day, [F(1,36) = 20.4, p < .01], and no
interaction.
Controllers had a lower mean (better performance) than did Monitors in Session 1 (Day 2),
but it was not significant [F(1,37) = .18]. As in Experiment 1, there was a slight non-
significant speed/accuracy trade-off in this condition (see Appendix B), favoring better
than monitors, but only in the throttle Visible condition (1.9 vs. 2.15), [F(1,37) = 4.19, p <
73
.05]. The mean performance score for Controllers was better than the Monitors, but not
significantly (2.01 vs. 2.04), [F(1,37) = .1]. Although there was no group effect favoring
Controllers over Monitors by condition, there was a marginally significant interaction (see
Figure 6), [F(1,36) = 3.67, p < .1], resulting from Controllers having poorer performance
in the throttle NotVisible condition compared to the throttle Visible condition (1.9 vs.
2.01), [F(1,17) = 2.35, p < .15], while Monitors performed better in the throttle NotVisible
condition, although this difference was not significant (2.15 vs. 2.04), [F(1,19) = 1.55].
2.6
Inferred Failure Index
2.4
2.2
Controllers
2
Monitors
1.8
1.6
1.4
Day 1 Session 1, Session 2 Session 2
Day 2 Visible NotVisible
Discussion
The results of Experiment 2 strongly support the experimental hypotheses, with few
Day 1 to Day 2 in Session 1 support the belief that Experiment 1 subjects either had not
74
learned the task or were not proficient at the task by the end of Day 1. Further, these data
given the additional feedback provided to subjects in Experiment 2 which likely facilitated
task acquisition.
but they were not significant in Experiment 2, although means in both experiments were in
the same direction. This may be a reflection of the fact that by Day 2 workload levels were
probably more similar between the two groups than in Experiment 1, as the additional
practice afforded by Day 1 may have reduced the workload levels for Controllers on Day 2.
This contention is based on the premise that workload in Session 1 in Experiment 1 was a
result of both having to learn to detect failures and learn how to control the system, in
addition to manually controlling the system, the latter two tasks not being applicable to
system monitors. However, in Experiment 2, much of the learning had already taken place,
leaving workload differences between the two groups a result only of the need to manually
Session 2 signaled failure detection performance supported the hypothesis that Controllers
scan the display more effectively than do the Monitors. Two features of the signaled failure
detection performance support this contention. First, and most importantly, is the fact that
there is a significant difference for Controllers between the throttle Visible and throttle
NotVisible condition, yet there is no such difference for Monitors. This is supported by
both the within-subjects’ comparisons and the marginally significant group interaction of
throttle visibility. This finding suggests that the Controllers scanned the peripherally-
located throttle information to facilitate inferential failure detection when the throttle was
present on the display. This scanning of the throttle information necessarily meant a cost in
75
RT for detecting signaled failures. Thus, in the throttle NotVisible condition, the
Controllers did not have the option of scanning the peripherally located throttle
information, and their signaled failure RT decreased significantly because attention was
focused only in the center of the display. Likewise, the Monitors signaled failure detection
suggesting that there was little, if any attention allocated to it when it was present.
Surprisingly, there was a significant group effect for signaled failure RT, and a marginally
significant interaction. This finding is contrary to the marginal effect found in Experiment
1 in which Controllers were slower than Yoked monitors. However, in the equivalent
throttle Visible condition in Experiment 2, there was no significant difference. This leads
to the speculation that the Experiment 1 finding was a statistical artifact. However, in the
throttle NotVisible condition, the Controllers were significantly faster than the Monitors,
leading to the significant group effect. This finding is contrary to expectations, as the
hypothesis was that these two groups should have performed similarly on the signaled
failure detection task, as both groups were focused similarly in the center of the display.
Although this finding is problematic for the hypothesis that the controller advantage while
monitoring is due to a higher activation state of the subject’s mental model, there are two
likely explanations which are consistent with the theory. The first is that Controllers
benefit from a more activated mental model of the system and that this activation not only
enhances their ability to perceive, integrate and analyze features of the task with greater
efficiency, but spills over such that even simple stimuli are perceived and responded to
more efficiently. The second possible, but less likely, explanation is that Controllers were
frustrated by the lack of throttle information in the throttle NotVisible condition and were
thus channeling extra effort into the task. While this extra effort did little to enhance
76
inferential failure detection, it did result in significantly better signaled failure detection
performance.
The pattern of outcomes for inferential failure detection conformed to the experimental
hypotheses. As with signaled failure detection performance, there was a significant effect
of Day in Session 1, with no group interaction. This reflected the fact that both Controllers
and Monitors were still learning the task into the second day. In Day 2 of Session 1, mean
performance for Controllers was better than the Monitors, but this difference was not
significant. This finding reflects a consistent trend in this paradigm that when compared
accuracy in favor of Controllers. This is likely a result of Controllers taking the time to
manipulate the system in order to “hypothesis test.” While hypothesis testing generated
more accurate performance, there was some cost in RT. However, none of these differences
(reaction time or accuracy) were significant when compared directly. The fact that both
Experiment 1 and Experiment 2 generated the same trade-off in Session 1 implies that this
is a true effect. Further, Wickens and Kessel (1979) found the same trade-off when
Controllers had slightly higher workload than Monitors in Session 1, which seems not to
have an effect on the Controllers’ inferential failure detection performance. This finding
further supports Wickens and Kessel (1980) who found that workload resulting from
manual response organization and execution (e.g., manual tracking) may not compete with
While results from tracking-task experiments suggest that the proprioceptive feedback from
tracking improves performance for Controllers, such direct feedback about system behavior,
77
specifically system failures, was not available proprioceptively to Controllers in the fuel
management paradigm. However, response related information might result from the act of
controlling the throttle and manipulating the fuel pumps, thus instantiating the “state” of
the system for Controllers. While this response information is certainly not as diagnostic
about system state as the proprioceptive feedback from tracking, it may serve a similar role
in updating the operator’s mental model of system activity (i.e., the dynamic execution of
the operator’s mental model), thus off-setting any performance deficits due to higher
workload. There are thus two non-competing explanations for the lack of effect of higher
workload for controllers. Either the resources required to control the system are different
than those required to detect subtle inferential failures, or, information obtained or
reinforced from the act of controlling made the task of detecting failures easier, and
therefore more resource efficient even though the resources were the same.
Results from Session 2 supported the experimental hypotheses and successfully replicated
previous findings of Controller superiority in the monitoring task. In the throttle Visible
condition, the Controllers had significantly better inferential failure detection performance
than Monitors. While this finding supports the hypothesis that controlling a system causes
one to be a more effective monitor of inferential failures, it is made more diagnostic by the
fact that no such advantage for Controllers exists in the throttle NotVisible condition.
There was no significant difference between Controllers and Monitors in the throttle
NotVisible condition, and the ability of Controllers declined slightly from the Visible to the
NotVisible condition. There was also a marginal Group by Visibility interaction, reflecting
78
While the intent of the throttle visibility manipulation was to affect signaled failure
detection performance, which it did, it was unknown whether the removal of throttle
information would actually hinder performance. In theory, the throttle display provides
information which is useful, but not critical, in diagnosing inferential failures. However,
data from Experiment 1 seemed to indicate that while Controllers were focusing more on
throttle information than Monitors, it did little to help them in detecting inferential failures.
However, in Experiment 2 the throttle mechanism was altered to make it a more valuable
information component in the detection of inferential failures. It appears that this change,
in combination with the increased proficiency gained from the additional day in Session 1,
caused individuals who focused more on the throttle information to have a distinct
Importantly, the effect of throttle visibility suggests that scanning the throttle information
was the critical behavior that enhanced Controller performance. This finding is easily
interpreted through Endsley’s (1995) view of the role of the well developed, or highly
activated, mental model of the behavior of a particular system. Endsley (1995, p.43)
suggests that this model, “provides (a) knowledge of the relevant elements of the system
that can be used in directing attention and classifying information in the perception process,
(b) a means of integrating the elements to form an understanding of their meaning, and (c)
a mechanism for projecting future states of the system based on its current state and an
understanding of its dynamics.” Viewed in the context of the current dynamic execution
theory of mental models, these data can be interpreted to suggest that the Controllers,
because of their activated mental model, direct attention to the throttle mechanism, given its
diagnostic importance in detecting failures, and then successfully integrate that perceptual
information with other momentary system attributes to successfully detect failures. When
the throttle information is not visible, this perceptual and computational advantage goes
79
unused, as is indicated by the non-significant performance difference between Controllers
and Monitors in the throttle NotVisible condition. Although Controllers may have had
some advantage, as seen in the mean difference favoring Controllers, this advantage was
Experiment 2 successfully supported the hypothesis and replicated findings that controllers
are better at detecting failures when transferring to a monitoring task than subjects who
monitor in both conditions. Further, the hypothesis that controllers may scan the display
more in an attempt to perceive task-relevant stimuli was also supported by the fact that
located failures when relevant system information was present in the periphery of the
display. In addition, it appears that the Controllers not only scanned the display for
information, but they perceived and integrated it more efficiently than monitors and were
thus more effective at detecting inferential failures. The only surprise was that Controllers
were, on whole, better at detecting signaled failures than were system monitors, suggesting
that there may be some carry-over effect from an activated mental model which is only
Several practical and theoretical implications can be drawn from these findings. Most
importantly, the transfer advantage of controllers over monitors was replicated using a more
realistic, cognitively complex dynamic task. The similarity of this paradigm to other
dynamic systems, and the convergence of these data with past findings supports the
contention that experience controlling a system (being “in the loop”) provides advantages to
operators when they must passively monitor the system. These findings also suggest that
80
controlling the system may make monitors more sensitive to system variability, and
especially to those features of the system which were controlled in the past. This strongly
supports concerns by Moray (1986) that there may be serious consequences when operators
learn to monitor a system without ever having controlled the system. Perhaps, in such
learning environments, the relationships between system variables are simply not
understood or appreciated to the same degree as when one must manually control system
variables. This is especially significant, given the suggestion that pilots transitioning into
highly automated aircraft have little opportunity to acquire or practice manual flying skills
manipulations except for the Controllers in the throttle Visible condition. The data,
however, showed that Controllers across the Visibility condition were significantly faster
than monitors, with the most significant difference being in the throttle NotVisible
condition. While this can be interpreted in a manner which does not contradict the
hypothesis, it may be viewed as somewhat problematic for a hypothesis that states that the
model. This would imply that a well-activated mental model not only guides perception to
critical features of that system, but it may also affect perceptual sensitivity to features
This experimental design does not preclude the possibility that controllers and monitors
develop slightly different mental models of the dynamic system, despite every effort made
in training to prevent it. While the controller’s mental model obviously contained an actual
81
motor-control component, it was believed that both groups would likely develop the same
underlying operational understanding of the system, and thus the same mental model for
use in inferential failure detection. It is possible, however, that the act of controlling in
Session 1, either through a more active learning experience, or by the reinforcing of certain
system-variable relationships resulting from controlling those variables, may have caused
the development of slightly different mental models. While this does not exclude a mental
model activation theory, it does suggest that Controllers may have a more activated, but
Although I believe that this experimental design is highly valid for operational
environments in which training departments have the choice of training future system
ecological validity in the current aviation context. All pilots of highly automated aircraft
commercial operations as suggested by Orlady and Wheeler (1989). Young (1969) and
Wickens and Kessel (1979) used a repeated measures design so that all subjects both
controlled and monitored. While this design generated failure detection performance
differences between system controllers and monitors, it was impossible to determine the
degree to which a more consistent internal model of the system contributed to the
Given the success of the current experiment in replicating Kessel and Wickens (1982), I
feel that a return to a repeated measures design using this cognitive dynamic task would
offer several distinct advantages for answering additional questions generated by this
82
experiment. First, a repeated measures design controls for the large between-subjects
variability found both in this experiment and typically in complex vigilance tasks
(Parasuraman, 1986). More importantly, however, it insures that all subjects develop the
same mental model of the system. While this feature was problematic for Wickens and
Kessel (1979), the fact that proprioceptive feedback is not a direct indication of system
failure in the current paradigm makes this a less pervasive problem. Further, a repeated
measures design has more ecological validity, helping to answer the question of whether
Experiment 3 uses the same dynamic fuel management task as in Experiment 2 but with a
repeated measures design. This design was altered so that all subjects were trained in the
controlling task and given sufficient time to become proficient at the task (two days in
addition to the training session, as in Experiment 2). Subjects then monitored and detected
failures for the next four days except for two trials on either days five or six. The subjects
failure-detection performance for both failure types was then compared for the two trials
following the controlling re-introduction to the same two trials after continued monitoring
As in the previous experiments, it is hypothesized that controlling the system would cause
model. Given the superior ecological validity of this design for aviation operations, the
83
hypothesized improvement of performance has strong implications for the value of
84
EXPERIMENT 3
Method
The Methods section for Experiment 3 highlights only differences from Experiment 2.
Subjects
Fifteen right-handed male university students were used in the experiment. Students were
paid a base hourly rate for their participation in the experiment. Additionally, subjects were
given the opportunity to earn a higher hourly rate for good performance.
Task
The task for Experiment 3 was the same as that used for Experiment 2 except for the
following change:
A message box was added to the lower left corner of the display informing subjects of the
participatory mode. The message stated either “Automatic control,” or “Manual control,”
and the messages were displayed in different colors to help alert subjects to any change in
training session prior to that day’s task, so no message system was necessary.
A completely within subjects’ transfer of training design was used to address the large
85
1986) and in the previous two experiments. All subjects learned the controlling task while
detecting both failure types, and proceeded to participate in the controlling mode for the
first two days. Subjects then spent the remaining four days in the monitoring mode, except
for the two 12 minute trials in which they were reintroduced to controlling.
Because of the potential confounding effects of trial and day in a within subject’s transfer of
training design, a pilot study for Experiment 3 was conducted to determine the best
sequence of conditions. The pilot study used four subjects who controlled the system and
detected failures for the first two days, then transferred to the monitoring mode on the Day
3. Subjects monitored the system and detected both failure types on Days 3 through 9.
Results of the Experiment 3 pilot study for trial effects showed a significant difference
between Trials 1 and 4 for inferred failures [F(1,4) = 9.9, p < .05], and a non-significant
difference in the same direction for signaled failures. There was a marginally significant
difference between Trials 3 and 4 for inferred failures [F(1,4) = 6.6, p < .1], but no
difference in means for signaled failures. There was no Trial by Day interaction indicating
a “by Day” stability for the observed trial effects. Importantly, failure detection
performance was stable in trials four and five for both inferred and signaled failures. (See
Figure 7.)
86
2.5
600
2
550
1.5 Signaled
500 Inf erred
1
450 0.5
400 0
1 2 3 4 5
The pilot study results for Days revealed the typical trend of an improvement from Day 1 to
Day 2 (both controlling days) for both signaled and inferred failures as seen in Experiment
2. Further, on Day 3 (the first monitoring day), inferred failure detection performance
declined, while signaled performance increased. More importantly, however, both inferred
and signaled failure detection performance are stable by Day 4 and remained that way
800 2.4
2.2
700 2
600 1.8
1.6
Signaled
500 1.4
Inf erred
1.2
400 1
0.8
300
0.6
200 0.4
1 2 3 4 5 6 7 8 9
87
Surprisingly, there was an improvement in inferred failure detection performance on Day 8,
without a concurrent improvement for signaled failures. While this initially appears
contrary to the hypothesis that monitors’ performances should decline after continued
paradigm. In a true operational environment operators would seldom see the same
the basis for system failure diagnosis. But in this experiment, it appears that sensitivity to
one type of inferred system failure may become a factor after long duration interaction with
this system. By the beginning of the eighth day, system monitors had already observed 210
inferential failures, not including the training session on the first day. It is therefore quite
likely that the tremendous exposure to the inferential failures used in this paradigm actually
failure.
Another explanation is that this improvement is due to subjects anticipating the end of the
experiment. However, this explanation is discounted because the increase occurred on Day
8, not Day 9, and there was actually a marginal decrease in performance from Day 8 to Day
9. Further, there was no concurrent performance increase for signaled failure detection on
Day 8.
Experiment 3 Design
Results from the pilot study revealed three significant design considerations for Experiment
3. First, given the stability of Trials 4 and 5 for both signaled and inferred failures, it was
determined that these trials would be the best for the between- and within-day comparisons.
88
This left the first 3 trials available for the requisite controller re-introduction. Secondly,
due to the stability in both signaled and inferred failure detection performance on Days 4
through 7, a six day experiment was chosen. This design allowed both sufficient
controlling experience by using Days 1 and 2 as controlling days, and also allowed a
Therefore, subjects controlled on either Day 5 or Day 6 (to counter-balance any potential
effect of Days) on Trials 2 and 3, and then monitored on Trials 4 and 5 (on either Day 5 or
Day 6 depending on the counter-balance). Comparisons were made between Trials 4 and 5
after controller re-introduction to Trials 4 and 5 after continuous monitoring. (See Figure
9.)
Training
Given that the results of the pilot study suggested that continued monitoring performance
might result in an increased sensitivity to the inferential failures in this paradigm (as seen
on Days 8 and 9), Experiment 3 was designed to minimize subjects’ exposure to inferred
89
failures. In addition to experimental issues, this consideration was warranted by the fact
that lower inferred failure exposure increased external validity. Therefore, although the
Day 1 failure rate remained the same as in Experiment 2 (six signaled/six inferred), the
number of inferred failures experienced by subjects was decreased in the remainder of the
experiment. On Days 2 through 4, subjects were exposed to four inferred failures on one
trial, two inferred failures on two trials, and zero on two trials. Signaled failures remained
constant for all trials to insure that subjects remained focused on the task even when no
two inferred failures on Trial 1, zero inferred failures on Trials 2 and 3 (the controller re-
introduction trials) and six inferred failures on the comparison trials (Trails 4 and 5).
environments would likely not expose operators to specific failures. In addition, having
subjects control the system without failures is a stronger test of the hypothesis that
controlling a system activates their dynamic model of the system, thus making them more
without inferential failures was consistent with their expectation bias for inferential failures
developed over the previous three days. This avoided implicit suggestion that Days 5 and 6
were any different than the previous days with the exception of the controlling re-
introduction.
90
Failure occurences, Experiment 3
Training
6 sig, 6 inf 6 sig, 0 inf 6 sig, 0 inf 6 sig, 0 inf 6 sig, 0 inf 6 sig, 0 inf
6 sig, 6 inf 6 sig, 2 inf 6 sig, 2 inf 6 sig, 2 inf 6 sig, 0 inf 6 sig, 0 inf
6 sig, 6 inf 6 sig, 2 inf 6 sig, 2 inf 6 sig, 2 inf 6 sig, 6 inf 6 sig, 6 inf
6 sig, 6 inf 6 sig, 4 inf 6 sig, 4 inf 6 sig, 4 inf 6 sig, 6 inf 6 sig, 6 inf
Training
The training session was 30 minutes at the beginning of Day 1, and was identical to
Experiment 2. A brief message appeared at the beginning of Day 2 informing subjects that
At the beginning of Day 3, a message appeared explaining that subjects should monitor the
system and detect both failure types. In addition, it was explained that there would be some
trials in which subjects were required to control the system and detect failures. Subjects
were therefore instructed to check the display at the beginning of each trial to see if the
system was in “Automatic” or “Manual” control mode. In addition, subjects were informed
that they would again see different numbers of failures on each trial for the remainder of the
experiment.
91
Results
Within-subject comparisons for Trial (4 and 5) within Condition (Post-Control [PC] and
Post-Monitor [PM]), and Trial by Condition were made for both signaled failure RT and the
combined RT and inferential failure error rate measure used in Experiments 1 and 2. An
analysis of variance (ANOVA) was used to test for Trail and Condition main effects and
interactions. Only Trials 4 and 5 on Conditions PC and PM were analyzed for differences.
Although the post-control and post-monitor conditions occurred on both Day 5 and Day 6
condition Post-Control [PC] and condition Post-Monitor [PM] for purposes of clarity. The
raw RT and error rate data for Inferred failures are provided in Appendix D.
Signaled Failures
performance (see Figure 12.) There was a marginally significant main effect of Condition
(failure detection post-controlling [PC] versus post-monitoring [PM]; 846 vs. 719), [F(1,14)
= 3.39, p < .1), but no main effect for Trial (Trial 4 vs. Trial 5; 789 vs. 777). Further,
there was no significant Trial by Condition (PC vs. PM) interaction. Planned comparisons
for Trials between and within Conditions (PC vs. PM) for signaled RT yielded significant
differences. There were no significant differences within Condition PC for Trial (4 vs. 5;
827 vs. 865), nor for Condition PM (750 vs. 688). In addition, there was no significant
difference between Conditions for Trial 4 (827 vs. 750). However, there was a significant
difference between Conditions for Trial 5 (865 vs. 688), [F(1,14) = 4.9, p < .05) as shown
in Figure 11.
92
1000
900
Signaled Failure RT
800
700
600
500
400
300
200
4 4 5 5
Post- Post- Post- Post-
Control Monitor Control Monitor
Figure 11. Signaled failure detection performance, Trials 4 and 5, Conditions Post-Control
Inferential Failures
differences were only marginally significant (p < .1). There was a marginally significant
main effect for Condition (PC vs. PM; 2.16 vs. 2.23), [F(1,14) = 3.39, p < .1], but no main
effect of Trial (4 vs. 5; 2.21 vs. 2.23). There was also a marginally significant Trial by
Condition interaction [F(1,14) = 3.74, p < .1], as post-controllers improved from Trial 4 to
Trial 5, but subject’s performance in the post-monitoring condition worsened from Trial 4
to Trial 5.
in the directions predicted by the hypothesis. There were no significant differences within
93
Condition PC comparing Trial 4 versus 5 (2.26 vs. 2.06, p = .13), nor in Condition PM
(2.16 vs. 2.31, p = .2), although the trend suggested by these data is interesting and will be
discussed further. There was no significant difference by Condition (PM and PC) for Trial
4 (2.26 vs. 2.16). However, there was a marginally significant difference by Condition for
Trial 5 (2.06 vs. 2.31), [F(1,14) = 3.21, p < .1) as shown in Figure 12. Because of the
marginally significant results using the combined index, RT and error rate were analyzed
separately for Trial 5. While there was no significant difference for RT (PC vs. PM; 8674
vs. 8839), there was a marginally significant difference for error rate (.2 vs. .29), [F(1,14) =
1000 2.35
Inferred failure index
2.3
Signaled failure RT
900
800 2.25
700 2.2
2.15 Signaled
600
2.1 Inf errred
500 2.05
400 2
300 1.95
200 1.9
4 4 5 5
Post- Post- Post- Post-
Control Monitor Control Monitor
Figure 12. Inverse relationship of Inferred and Signaled failure detection performance,
Because the design was counter-balanced between Days 5 and 6, the possibility existed that
the benefit of controlling on Day 5 would persist longer than the superseding trials and
94
perhaps into the next day. This effect would thus prevent a controller advantage from
appearing in the data since the comparison trials for Day 5 controllers were Trials 4 and 5
on Day 6. Therefore, a separate analysis was conducted using only Day 6 controllers
(who’s comparison trials were from Day 5 and thus untainted by prior controlling.)
While these results are potentially confounded by Day, they did support the primary
hypothesis that controlling benefits subsequent monitoring performance, but also the
hypothesis that this benefit may be of considerable duration. There was a significant
difference for condition (PC vs. PM), showing an advantage for post-controllers in the
combined measure for inferential failures, (1.8 vs. 2.31), [F(1,7) = 6.08, p < .05).
Additionally, the same comparison for error rate yielded a significant difference, (.21 vs.
.5), [F(1,7) = 7.98, p < .05). There were no other significant differences. However, the
signaled and inferred RT means were all in the same directions as the data from the
analysis when both groups were used (Day 5 and Day 6 controllers.)
Discussion
The results of Experiment 3 support the hypothesis that periodic controlling can improve
subsequent monitoring performance and, importantly, increase the external validity of this
previously monitored and controlled using tracking tasks (Kessel & Wickens, 1982; Young,
dynamic task, strongly suggest that controlling a system makes one more sensitive to
dynamic features of the system and thus more sensitive to system failures. While previous
most operational environments in which operators monitor dynamic systems, their training
95
This validity concern is especially acute in aviation environments where operators only
controlling the system manually. The design used in Experiment 3, however, demonstrates
that individuals with considerable hands-on manual control experience, then subsequent
monitoring exposure, will benefit from periodic reintroduction to controlling. While the
specific results showing improved inferred failure detection were marginally significant in
Trial five, this view is supported by Parasuraman, Mouloua, and Molloy (1996), who found
that monitoring performance was superior after a ten minute period in which some of the
previously automated tasks were returned to operator control. Although the anticipated
results were not present in Trial 4, the abrupt transition from controlling to monitoring was
Signaled Failures
The differentiation between signaled and inferred failures was originally developed to
state of system knowledge (see Experiment 1 for details). Since my theory states that
controlling should yield a more “activated” mental model for the operator or, more
activated current state system based mental model, it was hypothesized that inferential
failures would be detected more easily when the operator’s mental model was in its dynamic
activation state. However, signaled failures, which required the operator to respond to
simple stimuli, should remain unaffected by mental model activation because effective
analysis of system behavior yields no advantage for the detection of a signaled failure. In
96
essence, an individual could have no understanding of system operation, yet be perfectly
Results from Experiment 3 yielded the finding that past controllers, who presumably
benefited from a more activated mental model, were poorer at detecting signaled failures.
The likely explanation is that one aspect of an activated mental model is that subjects spend
more time scanning for vital information on the display (e.g., throttle information in the
periphery of the display), and thus less time focused in the center of the display where the
signaled failures occurred. While this phenomenon was not predicted, it is consistent with
the view that a proficient mental model guides perceptual activity to those features of the
This same signaled/inferred failure detection trade-off occurred in Experiment 1, and led to
information on half of the trials. If past controllers’ signaled failure detection disadvantage
was a result of more time spent looking at throttle information, then removal of the throttle
should alter this signaled failure detection deficit. Results from Experiment 2 were
over both conditions (discussed in detail in Discussion 2). However, with past-controllers
there was a significant difference in the signaled failure detection performance between the
throttle Visible and throttle NotVisible conditions, with poorer performance occurring in
the presence of the throttle display. There was no such effect for the past-monitors. This
finding supports the speculation that the presence of the throttle, and consequently the
subject’s attention to it, has a negative effect on signaled failure detection performance.
97
Further, it supports the contention that the past-controllers paid more attention to the
Because the relationship between signaled failure detection performance and the post-
made for this relationship for Experiment 3. It should be noted that this effect would have
little operational significance, since even when significant the response time differences for
signaled failures were relatively small (e.g., 667ms vs. 604ms, from Experiment 2). Rather,
detection performance in the post-controlling condition was significantly worse than in the
although only marginally significant, for the post-controllers. I believe that there are three
possible explanations for the apparent trade-off between signaled and inferential failure
detection performances. All likely explanations originate from the central point that
controlling makes a subject spend more time focused on throttle information and less time
focused on the center of the display where signaled failures occur. Because subjects do
focus more attention on the throttle, it is presumed that this information, at least in part, is
The first explanation for this trade-off in inferred and signaled failure detection
performance is based on the fact that subjects must allocate more resources to the throttle
portion of the task while controlling because one of their controlling tasks is throttle
98
management This task requires that subjects monitor the throttle display quite diligently
(subjects were told on the first day of the experiment that their bonus would be partially
determined by how well they managed the throttle on controlling trials) and to use the joy
is possible that the act of performing the task simply reinforces a scanning pattern which
incorporates the throttle. This explanation implies that when subjects return to the
monitoring task, their scanning behavior incorporates the throttle not because of increased
perceptual sensitivity to system attributes nor because a more activated mental model is
unconscious habit which, after controlling the system, happens to result in less scan time in
the center of the display and, therefore, poorer signaled detection performance. This
explanation, however, is not supported by the results. If a habit change was responsible for
the effect, then one would expect the strongest effect to occur directly after the controlling
condition, then weaken as subjects adapted to the monitoring task. However, this effect was
only present in Trial 5 of Experiment 3 when it should have been weakening. The
monitoring trial directly after the controlling re-introduction, Trial 4, showed no difference
scanning differences were habitual rather than cognitively driven, it seems unlikely that
there would be a resulting pay-off in inferential failure detection, although this is more
difficult to verify.
The second explanation for the change in scanning behavior is that the relationship
between throttle activity and overall fuel system behavior is strengthened when subjects are
forced to manipulate the throttle level by hand in the manual mode. This explanation
99
of the task they perceive as having the greatest pay-off in terms of failure detection. This
shift, however, must be involuntary since subjects are instructed to detect failures to the best
of their ability in every condition. There is, therefore, no valid reason for subjects to
intentionally select a less effective strategy. While this shift may be characterized as
(Parasuraman, et al., 1993), it is difficult to understand its origin. Perhaps the forced
which, unbeknownst to the subjects, has a harmful effect on their inferential failure
detection performance.
The final explanation for this trade-off, and the explanation most consistent with the
hypotheses, is that the reintroduction of controlling both the throttle and fuel pumps has the
effect of strengthening, or re-activating subjects’ mental models of the system. The effect of
this heightened system understanding, and the resultant increased sensitivity to system
operation, is that subjects pay greater attention to throttle activity and benefit from the
information it provides. Further, this explanation is consistent with Endsley’s (1995) view
that a good mental model guides perceptual activity to cues. This argues that the perceptual
process is generally outside of conscious awareness, and anecdotal evidence from subjects’
post-experiment comments suggests they were unaware that their attention to the throttle
Inferential Failures
Inferential failure detection performance was in the pattern predicted by the hypotheses,
condition, subjects were better at detecting inferential failures than when they had been
100
continuously monitoring. While the pattern between Trials 4 and 5 (the two comparison
trials) was not predicted, a marginally significant difference between groups occurred on
the fifth trial. Performance differences between the two conditions on Trial 4 were not
significant, and suggest that there is a transitionary period as subjects transfer from a
controlling to a monitoring mode. This is not surprising given the large differences
between the two tasks, but is likely a factor in need of further study before controlling
More importantly, however, I believe the fact that post-controllers improved from Trial 4 to
significant. This is especially true given that the Experiment 3 pilot study data suggest that
subjects’ performances reached a negative asymptote by Trial 4 and remained poor through
Trial 5. This suggests that periodic controller re-introduction may have the effect of
“resetting the clock” on the deterioration of monitoring performance. The effect presumes
that controlling the system is a considerably different task than monitoring the system, as is
the case in this paradigm (while the objective is the same, the subjects’ activities between
the two tasks are quite different). However, in most operational settings, the difference
Further, the “resetting the clock” concept is quite consistent with the theory that controller
re-introduction has the effect of re-activating the operator’s mental model of the system,
thus shifting the state of the operator’s model away from the static state and towards the
dynamic mental model state. If mental models have a state of activation, as proposed in
this theory, then it is likely that there must be some decay of this activation. Viewed
another way, a dynamic mental model can only remain dynamic, and thus provide
perceptual and calculational benefits, for a certain period of time after the features of the
101
task supporting the dynamic activation cease. While this issue is not directly addressed by
this research (other than at a speculative level), it is another element of the theory which
needs further exploration. Systematic exploration of dynamic mental model decay could
provide critical information for the use of periodic controlling to ward off the negative
extremely unlikely that commercial aviation would ever return to an exclusive manual
control environment. However, controlling might be used for short periods of time to
produce the desired effect. It is critical, therefore, that the exact duration of the positive
It is unfortunate that the inferential failure detection differences were only marginally
significant, but it should be noted that the design of this experiment produced an extremely
conservative test of the hypothesis. Young (1995), using a two dimensional tracking task,
found that subjects who controlled the system in the training portion of the transfer-of-
training experiment without exposure to failures did not show improved failure detection
performance during the monitoring portion of the experiment. This led Young (1995) to
conclude that controlling with failures, rather than just controlling, was responsible for the
order to derive benefit from controlling, the specific failures potentially encountered during
achieve maximum external validity, and because it would be the most stringent test of the
theory that mental model re-activation was responsible for post-controlling inferential
failure detection performance, and not a result of specific failure type sensitivity.
102
The amount of time subjects spent controlling during controller re-introduction was
somewhat arbitrary. Trials 2 and 3 were chosen because they avoided the use of trial one
(the trial shown to be significantly different from the other trials in the Experiment 3 pilot
study), yet still allowed two trials for comparison at the end of the session. Given the
transitional factors likely affecting the fourth trial, and the fact that some subjects stated in
post-experiment interviews that they were caught off guard by the re-introduction of
controlling, it is likely that some subjects only had solid controlling experience in the third
trial. Each trial lasted 12 minutes, giving subjects 24 minutes or less of controlling
experience depending on how quickly they recognized and adjusted to the change in
participatory mode. Given the relative difficulty of the controlling task, the improvement in
inferential failure detection performance may have been larger if subjects had been afforded
more controlling time. Additionally, given the general downward trend in performance
over trials (as seen in the Experiment 3 pilot study and in Experiment 3 itself), it is likely
that a stronger effect would also have been achieved by increasing the amount of
monitoring time on days five and six when controller re-introduction did not take place.
Another potential factor affecting the results was that subjects did not perform the task at
the same time each day. Subjects were required to participate in the experiment for six
allowed to participate anytime during the day. While it was originally thought that this
comments such as “I sure did a lot better on that experiment in the morning.” While the
variance attributed to this factor is unknown, future multi-day experiments should require
103
Experiment 3 was a successful test of the hypothesis because: a) signaled failure detection
mental model hypothesis, and c) both effects were strongest in the fifth trial suggesting
some long term reverse in the negative effects of continuous monitoring on failure detection
monitoring can have positive benefits for monitoring performance using an ecologically
valid design. Further, when Experiment 3 is considered in the context of this research and
other published works on this topic, it becomes increasingly clear that periodic return to
manual control may be one of the best weapons for fighting the negative effects of
Conclusion
This research adds considerable depth to previous studies in this field showing that manual
control of systems produces better system monitors. It lends support to the notion suggested
by myself and others (Endsley, 1995; Kessel & Wickens, 1982; Parasuraman, et al., 1996)
that the construct of a mental model may be the appropriate mechanistic explanation for the
mental model explanation for psychological phenomena may harbor seemingly excessive
that in the context of complex cognitive vigilance tasks, it effectively captures the
104
operation, and it may be a complex explanation which best captures this behavior. The use
of both signaled and inferred failures in this paradigm was novel and effective in
differentiating between vigilance decrements and deficits in the level of activation of the
operator’s mental model. To my knowledge, these experiments are the first to use failures
requiring different levels of cognitive processing for their detection in a single complex
vigilance task.
The first objective of this research was to extend findings that past controllers make better
Experiment 2 replicated the basic findings that controllers make better monitors, the use of
the throttle mechanism as both a separate controlling task and as an important information
appears that operators who are trained by controlling a system develop a higher level of
they control. When those subjects are then placed in a monitoring condition, their more
comprehensive understanding allows them greater acuity for important system behaviors,
and they use that information to effectively detect failures. Not only is this finding
theoretically significant in its own right, but it supports contentions by Moray (1986) that
system monitors must be trained in a manual control mode if they are expected to
The fact that past-controllers appear to scan important features of the display for system
errors were made on the aircraft’s FMS, yet pilots failed to observe the aircraft’s unintended
behavior. In each of these occurrences, ample evidence was available on the displays, yet
105
the pilots failed to perceive this information and process it to a level which should have
signaled the existence of a serious problem. In effect, because the pilots had been
monitoring for extended periods of time before these incidents, their perceptual activity was
blinded because system monitoring failed to require perception of these system variables,
and their perceptual cycle became derailed, at least in relation to the primary goal. In
essence, the inactivity of monitoring yielded a weak dynamic execution of their flying
mental model so that access and understanding of subtle system behavior and the
consequent perceptual activity were severely affected. While this is speculation, I believe
that it is the best explanation to date as to why experienced pilots failed to perceive a
This finding also supports the contention by Smolensky (1993) that the notion of situational
awareness may be related to certain physiological attributes. In fact, this finding strongly
supports the view that ocular movement may be a strong predictor of one’s situational
operator’s mental model of a task, and thus highly efficient perceptual activity as the
operator updates and integrates information pertaining to the task. I believe that this
perceptual activity should have a strong effect on one’s ocular movement, and is likely to be
The purpose of Experiment 3 was to use the ecologically valid task of the first two
operations. The new design allowed all subjects to learn and perfect the task in a
controlling mode, as is the case in the aviation domain. Subjects then monitored for several
days, and after extensive monitoring, they were momentarily re-introduced to the
controlling task. Even this multi-day design compresses time compared to most operational
106
settings, but it is more realistic than previous research and external validity is increased by
insuring that all subjects are trained in the same hands-on manner. Results from this
experiment showed that even a 24 minute controller re-introduction can have a positive
failure detection performance was significantly worse after the controlling re-introduction.
The combination of improved inferred and poorer signaled failure detection performance
imply that even a short period of manual control within an extended period of monitoring
can cause subjects to return to a more effective pattern of scanning, while perceiving and
integrating system information more effectively. Further, controlling seemed to have the
effect of “resetting the clock” so that after a true 50 minutes of system exposure subjects
were performing as if they had just started the task, even though in previous experiments
The results of Experiment 3 are the strongest evidence yet that periodic controller
reintroduction may be the best tool for airlines and other monitoring-intensive operations to
fight detrimental “out-of-the-loop” performance effects (Endsley, 1995). While the various
perspectives on this problem were outlined in the introduction of this paper, they have
generated few concrete solutions. However, the “controlling solution” appears not only to
be effective, but also easily implemented. In fact, the only cost seems to be the slight loss in
operational efficiency which occurs when human operators take control for a period of time.
realistic full mission simulator using commercial pilots. Second, more experimentation
needs to take place regarding the length of controlling versus the amount of resulting
positive benefit. It seems quite likely that the law of diminishing returns would apply to
107
controller re-introduction, but that point can not be determined without further
experimentation. In the same vein, it is also important to know the rate of decay of the
operator’s dynamic execution of their mental model, assuming the pilot hand-flies the
beginning of the mission, and only later is relegated to system monitor. It is also likely that
this factor is highly task dependent ranging from several minutes to several days. In fact, it
seems likely that the decay occurs in two dimensions: one dimension being task
complexity, the other capturing the extent to which it is a motor versus a cognitive task.
While these questions will be time consuming to answer, they will certainly yield vital
My goal with this research has been to elevate and refine past results showing the benefits
of controlling, and guide this line of research in a direction most beneficial for commercial
aviation and other industrial tasks which utilize continuous monitoring. I believe this line
of work shows that while the “out-of-the-loop” performance problem is both real and
serious, potential solutions are available. Further, unlike most solutions to serious
problems, where the benefit only slightly outweighs the costs, this “controlling solution”
108
REFERENCES
Adams, J. A., Tenney, Y. T., & Pew, R. W. (1995). Situational Awareness and
the Cognitive Management of Complex Systems. Human Factors, 37(1), 85-104.
Comstock, J. R., & Arnegard, R. J. (1992). The multi-attribute task battery for
human operator workload and strategic behavior research (Tech, memorandum 104174).
Hampton, VA: NASA Langley Research Center.
Confusion over flight mode may have role in A320 crash. (1992, Feb. 3).
Aviation Week & Space Technology, p.29.
Covey, R. R., Mascetti, G. J., Roessler, W. U., & Bowles, R., (1979, December).
Operational energy conservation strategies. Proceedings of the Institute of Electrical and
Electronic Engineers Conference on Decision and Control. Ft. Lauderdale.
109
Crash triggers review of AMR. (1996, January 1). Aviation Week & Space
Technology, p.30.
Indian A320 crash probe data show crew improperly configured aircraft. (1990,
June 25). Aviation Week & Space Technology, p.84.
Jaginski, R. J., & Miller, R. A. (1978). Describing the human operator's internal
model of a dynamic system. Human Factors, 20, 425-439.
110
Johannsen, G., Pfendler, C., & Stein, W. (1976). Human performance and
workload in simulated landing approaches with autopilot-failures. In T. B. Sheridan and
G. Johannsen (Eds.), Monitoring and Supervisory Control. New York: Plenum.
Kieras, D., & Boviar, S. (1984). The role of mental models in learning to operate
a device. Cognitive Science, 8, 255-273.
111
Reprinted in H. Sinaiko (Ed.), Selected papers on human factors in the design and use of
control systems. New York: Dover Publications, Inc., 1960.
Norman, S., Billings, C. E., Nagel, D., Palmer, E., Wiener, E. L., & Woods, D. D.
(1988). Aircraft automation philosophy: A source document. Flight deck automation:
Promises and realities, [Workshop manual]. NASA Ames Research Center: Moffett Field.
Parasuraman, R., Mouloua, M., & Molloy, R. (1996). Effects of Adaptive Task
Allocation on Monitoring of Automated Systems. Human Factors, 38(4), 665-679.
112
Parasuraman, R., Molloy, R., & Sing, I. L. (1993). Performance consequences of
automation induced "complacency." International Journal of Aviation Psychology, 3(1), 1-
23.
Sarter, N. B., & Woods, D. D. (1995). How in the world did we ever get into that
mode? Mode error and Awareness in supervisory control. Human Factors, 37(1), 5-19.
Sarter, N. B., & Woods, D. D. (1992). Pilot interaction with cockpit automation:
Operational experiences with the flight management system. The International Journal of
Aviation Psychology, 2(1), 303-322.
Sarter, N. B., & Woods, D. D. (1991). Situation awareness: A critical but ill-
defined Phenomenon. The International Journal of Aviation Psychology, 1(1), 45-57.
Sekigawa, E., & Mecham, M. (1996, July 29). Pilots, A300 systems cited in
Nagoya Crash. Aviation Week & Space Technology, 36-37.
113
Thackray, R. I., & Touchstone, R. M. (1989). Detection efficiency on an air
traffic control monitoring task with and without computer aiding. Aviation, Space and
Environmental Medicine, 60, 744-748.
Van Cott, H. P., Wiener, E. L., Wickens, C. D., Blackman, H. S., & Sheridan, T.
B. (1996, October). Smart automation enhances safety: A motion for debate. Ergonomics
in Design, 4(4), 19-23.
Wickens, C. D., & Kessel, C. (1979). The effects of participatory mode and task
workload on the detection of dynamic system failures. IEEE Transactions on Systems,
Man, and Cybernetics, SMC-9(1), 24-34.
Wiener, E. L. (1993). Life in the second decade of the glass cockpit. Proceedings
of the Seventh International Symposium on Aviation Psychology, 1-11.
114
Wiener, E. L. (1985). Cockpit automation: In need of a philosophy (SAE Tech.
paper 851956). Washington D. C.
Wiener, E. L., & Curry, R. E. (1980). Flight deck automation: Promises and
problems. Ergonomics, 23(10), 995-1011.
Williams, M. D., Hollan, J. D., & Stevens, A. L. (1983). Human reasoning about
a simple physical system. In D. Gentner & A. Stevens (Eds.), Mental models (pp. 131-
153). Hillsdale: Erlbaum.
115
APPENDIX A: Experimental task.
116
APPENDIX B: Experiment 1 Inferred failure RT and error rate.
Session 1 Session 2
Day 1 Day 2 (monitoring)
Controllers Auto-pilot
Control
Yoked
Auto-pilot
"Auto-pilot" monitors
Auto-Pilot
Yoked
Auto-pilot
"Yoked" monitors
Yoked
Yoked
Experiment 1 participatory modes.
Session 1 Session 2
Day 1 Day 2 (monitoring)
Controllers .53/8963
.51/817l
.49/8660
.7/8116
"Auto-pilot" monitors
.77/10506
.62/10401
.76/8911
"Yoked" monitors
.74/8389
.58/9026
Experiment 1 Inferred failure detection performance, error rate and reaction times. (Error
rate/RT).
117
APPENDIX C: Experiment 2 Inferred failure RT and error rate.
Session 1 Session 2
Day 1 Day 2 Day 3, (Monitoring)
Throt Vis
Controllers
Control Control
Throt NotVis
Throt Vis
"Yoked" monitors
Monitor Monitor
Throt NotVis
Session 1 Session 2
Day 1 Day 2 Day 3, (Monitoring)
.21/7315
Controllers
.37/8467l .26/8310
.24/7398
.3/7627
"Yoked" monitors
.44/8401 .33/8098
.25/7845
Experiment 2 Inferred failure detection performance, error rate and reaction times. (Error
rate/RT).
118
APPENDIX D: Experiment 3 Inferred failure RT and error rate.
Training
Training
Experiment 3 Inferred failure detection performance, error rate and reaction times. (Error
rate/RT).
119