Vous êtes sur la page 1sur 128

THE TRANSFER OF INFERRED VS.

SIGNALED FAILURE DETECTION


PERFORMANCE IN MONITORS AND CONTROLLERS
OF A COMPLEX DYNAMIC TASK
BY
GRANT E. YOUNG

A Dissertation submitted to the Graduate School in


partial fulfillment of the requirements
for the Degree
Doctor of Philosophy

Major Subject: Psychology


New Mexico State University
Las Cruces, New Mexico

August 1997

ii
VITA

November 29, 1965 -- Born in Oakland, California


1994 -- MA, New Mexico State University, Las Cruces, New Mexico
1988 -- BS, Denison University, Granville, Ohio
1989 - 1992 -- Hewlett-Packard Co., Palo Alto, California

PROFESSIONAL SOCIETIES

Human Factors and Ergonomics Society, National Chapter


Human Factors and Ergonomics Society, N.M.S.U Student Chapter, President 94/95

PUBLICATIONS

Young, G. E. (1995). The Impact of Trial Length and Mode Experience on Failure-
Detection Performance in Monitored and Controlled Dynamic Tasks. Proceedings of
the Eighth International Symposium on Aviation Psychology, 1031-1036.

Berringer, D. B., Allen, R. C.. Kozak, K. A., & Young, G. E. (1993). Responses of Pilots
and Non-pilots to Color-coded Altitude Information in a Cockpit Display of Traffic
Information. Proceedings of the Human Factors and Ergonomics Society 37th Annual
Meeting, 84-87.

FIELD OF STUDY

Major Field: Experimental Psychology


Engineering and Aviation Psychology, Human-Computer Interaction

ii
ABSTRACT

THE TRANSFER OF INFERRED VS. SIGNALED FAILURE DETECTION


PERFORMANCE IN MONITORS AND CONTROLLERS
OF A COMPLEX DYNAMIC TASK
BY
GRANT E. YOUNG

Doctor of Philosophy in Psychology


New Mexico State University
Las Cruces, New Mexico, 1997
Dr. James E. McDonald, Chair

Previous research has shown that active controllers can detect failures in a simple dynamic
system faster and more accurately than passive monitors. Further, when controllers transfer
to a monitoring task, they also have better failure detection performance than subjects who
only monitor. This dissertation has two objectives: (a) to replicate previous tracking-task
based findings using a new, cognitively complex dynamic task with failure types which tap
into different cognitive processes, and (b) to use this new task paradigm in an ecologically
valid experimental design to further explore the demonstrated advantages of controlling.
Further, this dissertation advances the contention that the controller/monitor issue should
be conceptualized as a difference in the level of activation of the operator’s mental model of
the system. Results from Experiment 1 fail to replicate past findings of a controller
advantage, but yield the surprising result that past controllers may scan the display more
effectively. Experiment 2 improves upon the basic design of Experiment 1 and makes it
possible to explore the issue of controller versus monitor scan differences in greater depth.
Experiment 2 successfully replicated the controller advantage observed in tracking-task
experiments, and supports the conclusion of Experiment 1 that controllers scan the display
more effectively and use the information gained to their advantage. Experiment 3 uses the
same experimental paradigm, but in a design more representative of operational settings.
All subjects in Experiment 3 learned in a controlling mode and then transferred to the

iii
monitoring task. However, subjects were periodically reintroduced to the controlling mode
and its effects on their subsequent monitoring performance were measured. Results
demonstrate that controller reintroduction has a positive effect on monitoring performance.
Implications of these findings for operational environments are discussed in detail.

iv
TABLE OF CONTENTS

LIST OF FIGURES.......................................................................................................... viii

INTRODUCTION ............................................................................................................... 1

Pros and Cons of Automation ....................................................................... 3

Advantages of Automation............................................................................ 4

Problems with Cockpit Automation............................................................... 8

The Role of the Vigilance Decrement .......................................................... 10

Peripheralisation ......................................................................................... 13

Loss of Motor Skills ................................................................................... 15

Reduction of Small Errors at the Cost of Occasional Large Errors............... 16

Are Multi-modal FMSs too Complex?......................................................... 18

Is Workload Lower, or Just Different? ........................................................ 25

Automation Induced Complacency.............................................................. 27

Situational Awareness................................................................................. 30

Mental Models............................................................................................ 33

Relevant Research....................................................................................... 39

Paradigm History ........................................................................................ 40

Present Research......................................................................................... 44

Experimental Hypotheses, Experiment 1 ..................................................... 47

EXPERIMENT 1............................................................................................................... 49

Method ....................................................................................................... 49
Subjects ........................................................................................................................ 49

v
Apparatus ..................................................................................................................... 49
Task ............................................................................................................................. 49
Experimental Design .................................................................................................... 52
Training ....................................................................................................................... 53

Results........................................................................................................ 54
Signaled Failures .......................................................................................................... 56
Inferential Failures ....................................................................................................... 57

Discussion .................................................................................................. 58

Experiment 1 Conclusions........................................................................... 64

Implications for Experiment 2 ..................................................................... 65

Experimental Hypotheses, Experiment 2 ..................................................... 68

EXPERIMENT 2............................................................................................................... 69

Method ....................................................................................................... 69
Subjects ........................................................................................................................ 69
Task ............................................................................................................................. 69
Experimental Design .................................................................................................... 70
Training ....................................................................................................................... 71

Results........................................................................................................ 71
Signaled Failures, Session 1.......................................................................................... 72
Signaled Failures, Session 2.......................................................................................... 72
Inferential Failures, Session 1....................................................................................... 73
Inferential Failures, Session 2....................................................................................... 73

Discussion .................................................................................................. 74

Conclusions and Implications ..................................................................... 80

Experiment 3 Experimental Hypotheses ...................................................... 83

EXPERIMENT 3............................................................................................................... 85

Method ....................................................................................................... 85

vi
Subjects ........................................................................................................................ 85
Task ............................................................................................................................. 85
Experimental Design Considerations ............................................................................ 85
Experiment 3 Design .................................................................................................... 88
Training ....................................................................................................................... 91

Results........................................................................................................ 92
Signaled Failures .......................................................................................................... 92
Inferential Failures ....................................................................................................... 93
Day 6 Controllers - Separate Analysis........................................................................... 94

Discussion .................................................................................................. 95

Signaled Failures......................................................................................... 96

Inferential Failures .................................................................................... 100

Conclusion................................................................................................ 104

REFERENCES ............................................................................................................... 109

Appendix A: Experimental task...................................................................................... 116

Appendix B: Experiment 1 Inferred failure RT and error rate........................................ 117

Appendix C: Experiment 2 Inferred failure RT and error rate. ....................................... 118

Appendix D: Experiment 3 Inferred failure RT and error rate. ....................................... 119

vii
LIST OF FIGURES

Figure 1. Experimental design, Experiment 1. Session 2 counterbalanced


by condition.................................................................................... 53

Figure 2. Experiment 1, Signaled Failure RT. ....................................... 56

Figure 3. Session 2 Inferential Failure detection performance (combined


index). ............................................................................................ 58

Figure 4. Experimental design, Experiment 2, Session 2 counterbalanced


by condition.................................................................................... 71

Figure 5. Session 2 Signaled Failure RT. .............................................. 73

Figure 6. Session 2 Inferential Failure detection performance (combined


index). ............................................................................................ 74

Figure 7. Inferred and Signaled Failures by Trials, Days 4 - 7. .............. 87

Figure 8. Inferred and Signaled Failures, Days 1 - 9.............................. 87

Figure 9. Experiment 3 experimental design, participatory mode, counter-


balanced on Days 5 and 6. Comparison trials outlined in bold......... 89

Figure 10. Experiment 3 experimental design, failure occurrences by


failure type, randomized by subject on Days 2 - 4. Comparison trials
outlined in bold............................................................................... 91

Figure 11. Signaled failure detection performance, Trials 4 and 5,


Conditions Post-Control (PC) and Post-Monitor (PM). .................. 93

Figure 12. Inverse relationship of Inferred and Signaled failure detection


performance, Trials 5, Conditions Post-Control and Post-Monitor. . 94

viii
ix
I know I’m not in the loop, but I’m not exactly out of the loop. It’s more
like I’m flying alongside the loop.
-Anonymous Boeing 767 Captain, (Wiener, 1988)

INTRODUCTION

The interaction of pilots and highly automated aircraft has become an increasingly studied

topic in recent years. This interest has been fueled not by the pursuit of academic

enlightenment, but rather by a series of fatal aircraft accidents in which the pilot/auto-pilot

interaction was the primary cause (Billings, 1991). I believe that the origin of this

precarious order of events is the result of industry, consumers, and safety advocates alike

embracing a promising technology. While some skeptics presented a view that the

pilot/auto-pilot relationship may not perform at the consonant level anticipated, this

perspective became overwhelmed by the multiple advantages promised by automation. In

fact, despite the concerns over cockpit automation, most would agree that an aircraft with a

high level of automation is more efficient, and possibly even safer than a similar aircraft

without it (Billings, 1991). Although accidents per passenger mile continue on a downward

trend, numerous examples of a new pilot-automation interaction problem are showing up in

the probable cause section of accident reports. The problem, however, is not that

automation exists in modern aircraft, but rather that a lack of foresight has yielded systems

and procedures intolerant of certain predictable, yet unavoidable errors.

Automation, when used in the aviation domain, is a broad term. It does not refer to a single

device, but rather a class of devices which control the various dynamic processes in an

1
aircraft ranging from basic mechanical systems to the actual task of “flying” the aircraft.

For purposes of clarity, when “automation” is used in this paper, it refers to the definition

used by Billings (1991): “A system in which many of the processes of production are

automatically performed or controlled by self-operating machines, electronic devices, etc.”

As will be discussed in greater detail, the level of automation in aircraft has been creeping

across the automation continuum for the last 80 years, and has only in the last several

decades become so prominent in the cockpit that it has raised serious concerns.

The question of whether or not to automate civilian transport and military aircraft of all

types is now merely academic (Billings, 1991). Prior to the 1950s the question was “what

can we automate?” The evolution of microelectronics and microprocessors quickly changed

the question to “what should we automate first?” By the early seventies, such questions

were rarely asked as virtually every component of the cockpit had become, or was on its

way to becoming, highly automated. It has only been since the late 1980s that the question

of “what and how much” to automate has again become an important and serious question

in the design of commercial aircraft. Longtime proponents of automation presently

acknowledge that automation must be more ”intelligent” if it is to achieve the levels of

safety initially anticipated (Van Cott, Wiener, Wickens, Blackman, & Sheridan, 1996),

while skeptics of automation continue to believe that multi-modal automation may simply

be too complex to be managed by human pilots (Sarter & Woods, 1993).

While the future of automation seems to be progressing toward the integration of increased

levels of “intelligence” combined with clearer operational features, a simpler pilot-

automation interface, and appropriate “human centered” operational procedures,

fundamental arguments for and against different implementations of automation will be

discussed in detail in the following section. Proceeding that discussion, the various

2
experimental paradigms and research perspectives which have focused both directly and

indirectly on the issue of cockpit automation will be discussed in detail. These include

vigilance research, peripheralisation, motor-skill factors, the small error/large error trade-

off, automation reliability, flight management system mode complexity, workload levels,

automation induced complacency, situational awareness, and the role of mental models as a

framework for conceptualizing problems with automation. Finally, specific past research

relevant to the present research perspective will be discussed, along with a discussion of

how the present research complements other research in this field.

Pros and Cons of Automation

One need only briefly review the history of both commercial and military aviation to

appreciate the desire of many to reduce the level of human operator control in the cockpit.

Although difficult to quantify, best estimates put the direct contribution of human error to

commercial aviation disasters at approximately 70% (Nagel, 1988), while the role of some

human error as a contributing factor in the chain of events leading to an accident is likely

even higher. This of course does not include fatal mishaps on railroads, ships, automobiles,

and industrial applications, but the percentages are likely similar (Van Cott, Wiener,

Wickens, Blackman, & Sheridan, 1996). Although human error is a complex concept, it

can generally be broken down into the following categories (Woodson, 1981):

1) Perceptual errors: Searching for and receiving information, and identifying objects,

actions and events.

2) Mediational errors: Failure to process information, solve problems or make decisions

correctly.

3) Communication errors: Failure of communication between crew members, crew to Air

Traffic Control, and trainers and manufacturers to crew members.

3
4) Motor errors: Failure to execute simple and complex, discrete and continuous, motor

behaviors correctly.

Automation has, interestingly, in its own duplicitous manner both addressed and

aggravated human error in each of these categories. The purpose of the following

discussion is to explore in detail both the pros and cons of cockpit automation, and the

complex way that it interacts with human error. In fact, this discussion will highlight the

observation that technology has the potential to solve problems while at the same time

introducing new problems.

Advantages of Automation

Ample studies of human error in the cockpit have come to the conclusion that a primary,

yet not exclusive cause of human error is excessive workload (Kantowitz & Sorkin, 1983).

Prior to the use of automation in the cockpit, pilots were forced to attend to and manage the

many complex systems in the aircraft (e.g., fuel distribution, engine management, cabin

pressurization, etc.) and fly the aircraft (e.g., manual control, navigation, and

communication). While the first generation of automation included two dimensional

aircraft control and simplified radio navigation, the second generation of aircraft

automation included the consolidation of displays into integrated displays, the transition

from raw data into more usable command information (e.g., a flight director), and the use

of air data computers to integrate multiple sources of information regarding air density and

direction into usable information both for the pilot and auto-pilot (Billings, 1991). Today,

the third generation of automation sees all complex systems in the aircraft partially or

completely automated (Satchell, 1993), with an emphasis on the integrated management of

all the automated devices on the aircraft (Billings, 1991). If they choose, pilots need only

4
be involved in commanding the automation through the Flight Management Computer,

becoming intimately involved in aircraft control only if a problem or unusual circumstance

occurs. It is for this reason that the majority of aircraft produced today are designed for two

pilots, as compared to four (two pilots, plus a flight engineer and navigator) which was the

case only thirty years ago. In fact, few would argue that the application of automation in

systems management has not been a tremendous success.

Because second generation sub-system management was relatively mundane and

straightforward, the automation was relatively simple and its execution relatively error free.

Although such automation eventually eliminated the need for a flight engineer altogether, it

did relatively little to ease the workload of the flying pilots since much of the system’s

automation replaced the flight engineer’s duties, but not necessarily the pilots’. Although

flying an aircraft under cruise condition requires relatively low levels of workload, getting

the aircraft from the ground to cruise, and then from cruise to the ground requires

considerable effort on the part of the pilots (Billings, 1991). In addition, a statistical

breakdown of aircraft accidents demonstrates convincingly that these portions of the flight

contain the greatest risk. In fact, 90% of all accidents occur in the climb to, or descent from

cruising condition (Nagel, 1988). This statistic is made more profound by the fact that

these two phases of flight account for less than 40% of the flight time (Nagel, 1988). In

fact, accidents during cruise account for less than 9% of all aircraft accidents, yet cruise

flight accounts for 60% of flight time. Not only must pilots communicate with air traffic

control (ATC) and navigate to the correct location, but they must maintain control of the

aircraft in the desired attitude, altitude, vector, and velocity. Although this task is not

difficult for the seasoned pilot, it nonetheless requires considerable attention to be

accomplished effectively. Workload requirements can be further increased in this phase of

5
flight due to the presence of hostile weather, aircraft traffic and frequent ATC

requirements, or by systems problems (Nagel, 1988).

Given the complexity of the flying task in these phases of flight, the increased risk of

mishap, and the strong correlation between pilot workload and pilot error (Kantowitz &

Sorkin, 1983), tremendous effort was put into the design of Flight Management Systems

(FMS) which, through computerized command of navigation and aircraft control, ease the

burden of controlling the aircraft in these critical phases of flight. Such automation, in

theory, eliminates many time- and resource-consuming tasks which contribute to pilot

workload. Advocates of automation emphasized that by reducing a pilot's workload,

attention could be directed to monitoring mission progress and overall system status, rather

than burdening a pilot's cognitive resources with command and control processing.

Further, should a partial or complete failure occur with the automation, the pilot could

quickly and effectively diagnose and re-engage at the point where the auto-pilot

relinquished authority.

If one were to tour the cockpit of a modern airliner, one would find a Flight Management

System (FMS) which not only has the capacity to successfully control and navigate the

aircraft through descent and ascent, but can fly the aircraft from takeoff to taxi at the

destination without a single pilot intervention of the aircraft controls (Billings, 1991). In

fact, until recently, it had been the operational policy of air carriers to encourage their pilots

to use their FMS to its fullest capacity, leaving the pilots with the duty of high level

management of the flight (Billings, 1991).

The other compelling reason for the development of the FMS besides the belief that the

human’s limited capacity for workload was the primary barrier to aircraft safety was the

6
acknowledgment that human inner loop control precision was very limited (Billings, 1991).

Not only is the task of precise control tedious and perceptually demanding, but the high

control error levels of human operators mean considerable loss in efficiency. In fact,

microprocessor control of flight allows all flight phases and transitions to be accomplished

at maximal efficiency. Not only do human pilots lack the specific knowledge of how to

execute control maneuvers with perfect efficiency but, even with this knowledge, their

control accuracy is inadequate. The use of Flight Management Systems has thus introduced

far greater efficiency to the aircraft system, regardless of changes in aerodynamics or

engine efficiency. In fact, Covey et al. (1979) suggested that a 12% savings in fuel

consumption could be achieved by optimizing operational efficiency (not including physical

changes to aircraft systems) with much of this gain coming through the use of automation.

Another study cited by Wiener and Curry (1980) suggested that a three percent reduction in

fuel consumption could result in a 26% increase in airline profits. Fueling this drive

toward efficiency was also the fact that the price of a gallon jet fuel went from 38 cents in

1978 to 70 cents in 1979 (Wiener & Curry, 1980), and went above a dollar in the 1980s

where it remains today. Improved efficiency clearly increases the profitability of airlines,

increases the need for new aircraft, lowers ticket prices, and reduces environmental impact.

Further, by lowering the cost of flying to the general public, overall transportation safety is

theoretically enhanced by moving people into air travel and away from more dangerous

forms of personal transportation.

The final reason for the push towards automation was the need to address specific human

error induced safety concerns such as controlled flight into terrain and air-to-air collisions.

In accidents such as these, it was often clear that sufficient information was present so that

given prompt and accurate interpretation of the information, such disasters could be

7
avoided (Billings, 1991). Computerized systems were thus developed to deal effectively

with these specific safety hazards.

A good example of this automation is the Ground Proximity Warning System mandated by

Congress in 1975 to address a series of “controlled flight into terrain” incidents (Wiener &

Curry, 1981). This simple form of automation combines radar and barometric altimetry to

calculate height above ground and rate of change, therefore predicting when a possible

unintended conflict with the ground might occur (Billings, 1991). Such automation is

advisory only, thus leaving ultimate command authority to the pilot. Other examples of

“problem specific” automation include devices which force the control column of an aircraft

forward (known as a “stick pusher”) to avert an aerodynamic “stall,” and the Traffic Alert

Collision Avoidance System (TACAS) which receives transponder signals from other

aircraft and displays them in relation to one’s own aircraft thus warning of potential

conflict. There are many other such systems in modern aircraft and most would agree that

current “problem specific” automation has been quite successful, despite the common

appearance of problems when these systems are first instituted (Billings, 1991).

Problems with Cockpit Automation

Although the previous section highlighted the positive aspects of increased automation, this

evolution has festered considerable controversy and numerous disasters. Some highly

visible incidents have illuminated the fact that the transition to automation has not been

problem free. I believe that careful analysis of the problems with aircraft automation show

it to be a combination of traditional human performance problems seen in other man-

machine systems, combined with other unpredicted problems discovered through the

8
analysis of accidents and incidents, data from the Aircraft Safety Reporting System (ASRS)

and simulator studies. The following is a list of those factors:

1. Given ample evidence of the poor monitoring ability of humans, can pilots be trusted to

monitor complex systems for long durations with adequate vigilance?

2. Does automation cause “peripheralisation” as a result of removing pilots from the

control loop, thus reducing their effectiveness as human operators?

3. Does automation cause a degradation of motor skills which will impair pilots when, by

choice or by force, they are required to re-enter the control loop?

4. Does automation eliminate frequent small human errors but give way to infrequent

serious errors?

5. Can automation be designed so that it is as reliable as advocates predict, and intelligent

enough to circumvent typical “dumb” computer errors?

6. Can human pilots effectively program and monitor the “multi-modal” Flight

Management Systems which have many programmable flight modes, and can switch modes

based on internal factors and not pilot input?

7. Does automation really reduce pilot workload, or has the workload remained the same

but transferred from “flying the plane” to “programming the computer?”

8. Does the reliability of automated systems cause complacency in the cockpit which has an

adverse effect on the efficiency of humans as monitors?

9. Does continuous monitoring by pilots cause a loss of “situation awareness” which could

adversely affect their monitoring performance?

10. Do pilots have “mental models” of the flying task which may be adversely affected by

“flying the automation” rather than flying the aircraft, and thus preventing effective system

monitoring?

9
This list of potential problems with automation are clearly not independent, and any

theorized or observed problems with the pilot-automation interface is likely causally related

to several of these factors. The following section discusses each of these factors in greater

detail.

The Role of the Vigilance Decrement

Research on the ability of individuals to maintain effective sustained attention of real time

processes was first studied in earnest in World War II (Parasuraman, 1986), although some

concern can be traced to early questions about inspectors’ abilities to detect assembly line

defects (Wiener, 1984). The advent of radar produced the need for human operators to

monitor this new technology and efficiently detect enemy threats in a highly monotonous

task with few signals. Through a series of field experiments on both sides of the Atlantic, it

quickly became clear that the fragile nature of human monitoring performance meant that

normal working schedules were inappropriate for sustained attention tasks (Wiener, 1984).

In fact, early field studies by the RAF Coastal Command suggested that radar monitoring

performance declined after approximately 30 minutes (Parasuraman, 1986), while a team

from the United States suggested that monitoring periods not exceed 40 minutes (Wiener,

1984). In fact, research led by Mackworth (1950) verified in the laboratory what the

military had observed in the field and initiated the term “vigilance decrement” to describe

this phenomenon.

The vigilance decrement referred to the fact that after a given period of sustained attention,

human operators lost their ability to effectively discriminate between signals which they

otherwise could (Parasuraman, 1986). Although the causes of the vigilance decrement are

complex and still debated, its existence in simple sustained attention tasks is unquestioned.

10
Fortunately, remedies for this problem were unusually simple. Once the onset of the

vigilance decrement was established, a work shift routine was designed so that no operator

was monitoring past the period of full vigilance, and each operator was given enough of a

break so that full vigilance was restored.

The success in early vigilance research was due in large part to the fact that the actual task

of observing radar displays lent itself well to the design of experimental tasks for laboratory

research (Parasuraman, 1986), although a minority criticized the early research for its

artificially high signal rates (Wiener, 1984). This convenient and uncommon

circumstance, combined with parallel findings in field research, meant that the early

vigilance research was widely accepted and did not suffer the typical validity issues

associated with laboratory research. This was not the case, however, when researchers tried

to apply vigilance research findings to other, usually more complex, sustained attention

tasks. Not only was the majority of vigilance research conducted in the laboratory focused

on extremely simple, low arousal tasks, but when more complex paradigms were used, the

results were often equally complex (Parasuraman, 1986).

Early vigilance research in which more complex paradigms were used sometimes found

little or no vigilance decrement, leading many to view the vigilance decrement as a

laboratory phenomenon not applicable to complex real world tasks (Parasuraman, 1986).

One predominant view of complex task vigilance was based on research conducted by

Adams et al. (1961). This research used a simulated air defense task in which the number

of non-signal targets was either 6 or 36. Although overall detection performance was worse

when the non-signal targets were more abundant, performance did not change with

increases in time spent at the task. This finding led Adams et al. (1962) and others (as

cited in Parasuraman, 1986) to believe that complex tasks yielded sufficient arousal to

11
prevent a vigilance decrement. Still other studies, however, demonstrated a strong

vigilance decrement. These studies include a three-clock version of the Mackworth clock

test and multi-channel auditory monitoring tasks (Parasuraman, 1986).

Several theories have been offered for the disparity of results in complex task vigilance

research. Most importantly, the tasks and procedures vary widely and thus make it difficult

to classify tasks along a continuum of “task complexity” (Parasuraman, 1986). Further,

Adams et al. (1961) suggested that because of large individual differences in complex task

performance, slight vigilance decrements may exist, but fail to reach statistical significance

in laboratory settings. Supporting this, Parasuraman (1976) found that between-subject

variability in detection rate for a dual-source visual discrimination task was nearly twice

that for a single source task. Another explanation contends that since complex task

performance is already poor, there is little opportunity for it to get worse with time (Davies

& Tune, 1969). Additionally, it has been proposed that when a complex-task vigilance

decrement exists, it may be only a slight decrement in sensitivity, thus having only slight

effects on detection rate (Parasuraman, 1986).

Only a brief review of complex task vigilance research is necessary to appreciate the

historical difficulty in finding a consistent and robust vigilance decrement in complex tasks.

Although some may see this as an implication of a weak if not irrelevant phenomenon,

others have used these diverse findings to support the contention that a vigilance decrement

does exist in complex tasks, but it is difficult to capture experimentally (Parasuraman,

1986). Regardless, however, there is little doubt that the diversity of complex tasks in the

operational environment make it extremely difficult to design research tasks ecologically

valid enough to generalize findings to the operational environment. Thus, it has been

proposed that any theoretical findings from laboratory settings must be conducted in

12
parallel highly realistic paradigms to insure ecological validity (Satchell, 1993). While an

undertaking such as this may be unrealistic, it would likely quell some of the debate over

the existence of the vigilance decrement in complex tasks.

Peripheralisation

The term peripheralisation has been used to describe the process of roll change that pilots

experience as they become increasingly distanced from the essential flight process as levels

of automation increase (Billings, 1991; Norman, Billings, Nagel, Palmer, Wiener, &

Woods, 1988; Satchell, 1993). Satchell (1993) has described peripheralisation as a

“complex psycho-biological state which occurs as a consequence of automation.” This

peripheralisation process stems partially from the failure of aircraft designers to focus on

human needs in an “out of the loop” control environment (Wiener & Curry, 1980) but it

may also be an inescapable consequence of the automation process.

Satchell (1993) has organized the effects of peripheralisation into the following three

categories, some of which will be discussed in greater detail elsewhere in this paper:

1. Complacency: A “self-satisfaction which may result in non-vigilance based on an

unjustified assumption of satisfactory system state.” Key in the notion of complacency is

the consistency and reliability of automation, both of which have been shown to affect

monitoring performance (Parasuraman, 1993).

-Primary/secondary task inversion: A behavioral phenomenon in which a backup alerting

system, for example an altitude alerting system, becomes the primary information source for

the operators. Such task inversions usually result in altered operator monitoring behavior.

13
-Automation Deficit: The temporary and relative reduction in manual performance upon

resuming a task which has been previously automated.

-Boredom-panic syndrome: The behavioral phenomenon in which continuous monitoring

of automation leads to boredom, which then renders the operator handicapped to

sufficiently deal with a suddenly increased workload. An example of this is the high

workload levels often encountered below 10,000 ft. by cockpit crews following extended

monitoring during cruise flight.

2. Communication: Research into aircraft accidents has generated many examples that

show a strong relationship between peripheralisation and ineffective communication.

Flight crews who communicate effectively have been shown to communicate more

frequently, openly, directly and concisely compared to ineffective crews. However, studies

comparing crews in aircraft with different automation levels show that as level of

automation goes up, quality of crew communication goes down.

3. Situational awareness: “the accurate perception of the factors and conditions that affect

an aircraft and its flight crew during a defined period of time.”

-The Big Picture: Although related to situational awareness, the big picture refers to

awareness of the state of the system at a global level. An example of this would be the

China Air Lines crew who let their 747 stall and enter a spin while they attended to an

engine problem.

-Information acquisition: Automation can adversely affect information acquisition since

system automation often translates, interprets, and integrates raw data prior to being

presented on the pilot/system interface. Although this translation of raw data is often

advantageous, it can have some peripheralising consequences.

The concept of peripheralisation is quite useful as a construct in understanding the

changing role of human operators as automation increases. If a system designer’s goal is to

14
lower mental workload by reducing the amount of raw data received by the pilot, then some

peripheralisation may be acceptable. If, however, the automation of navigation

peripheralizes the crew to the degree that they no longer attend to the navigation of the

aircraft, a devastating result may occur should the automation fail or be misdirected by the

pilots.

Loss of Motor Skills

The loss of motor skill as a result of lack of practice is a major concern accompanying

increased automation in the cockpit (Endsley, 1995). Not only is the problem salient and

fairly well studied, but it has been a frequently reported concern of pilots of automated

aircraft (Hughes, 1989; Wiener & Curry, 1980). Further, Moray (1986) has emphasized the

need for operators of automatic systems to have extensive manual practice even though it

will seldom be used in actual operation. Interestingly, however, recent accidents involving

automation issues have not shown manual skill proficiency to be a primary concern, since

accidents have generally occurred when the auto-pilot was flying, or the automation and

pilot were “fighting” for control of the aircraft, thus interfering with each other’s relative

control commands (Aviation Week, 1996). This is not to say that degradation of skill is not

a problem, but rather that other automation factors seem to be more causally related to

aircraft mishaps.

It is ironic, however, that proponents of automation have long argued that much of the

value of automation resides in the fact that pilots, when required, can easily intervene and

pick up where the auto-pilot left off. In reality there is considerable evidence that persistent

monitoring of automation leads to some loss of sensitivity to the subtle dynamic

relationships between system variables (Kessel & Wickens, 1982; Shiff, 1983; Wiener &

15
Curry, 1980). It is common for co-pilots, when transferring from highly automated wide-

body aircraft to narrow body, less automated aircraft, to need a transition period to revive

their proficiency in manual control skills (Wiener & Curry, 1980). Further complicating

this issue is the fact that with the introduction of highly sophisticated FMSs, complimentary

changes in airline procedures discourage manual flight (Billings, 1991). Rather than being

evaluated on their manual flying skills, pilot proficiency is judged by the effective use of the

vastly capable and complex “integrated flight path and aircraft management systems” in the

cockpit (Billings, 1991). It should also be noted that several recent incidents, for example

the crash of a USAir 737 in Pittsburgh, although as yet unsolved, may have been related to

a wake turbulence induced unusual attitude which became catastrophic when attitude

recovery failed, possibly due to pilot proficiency.

Reduction of Small Errors at the Cost of Occasional Large Errors

The introduction of technology into society has created the interesting phenomenon of

reducing small errors of precision, at the cost of occasionally introducing very serious large

errors. Consider the frequently cited example of the digital alarm clock. The introduction

of this device meant that the accepted 10 to 15 minute precision error of the analog alarm

clock was now eliminated (Wickens, 1992). However, this technology meant that the

occasional “set up error” (Wiener & Curry, 1980) could yield an error, although infrequent,

of 12 hours, nearly 48 times the magnitude of the analog alarm clock. The same potential

for occasional catastrophic errors exists in the automated cockpit (Wiener, 1988). Consider

the following example:

January 20th , 1992, Strasbourg, France:

16
An Airbus A320 flew into the ground while on a non-precision approach to

Strasbourg-Entzheim Airport in France. A post-crash analysis of flight data

determined that the aircraft was descending at a rate of 3,300 ft/min during its

pre-crash descent, far steeper than the 700 ft/min required by the approach

(Aviation Week, 1992). However, the VOR/DME approach chart for that

airport required a 3.3 degree angle of descent, which is what the pilots most

likely intended to input into the Flight Management System. Rather than

entering "3.3" into the "Track/Flight Path Angle" descent mode, the mode

"Heading/Vertical Speed" was inadvertently selected. This "mode error" by the

pilots caused the aircraft to descend at 3,300 ft/min, rather than the intended

3.3 degree angle of descent. The A320 crashed short of the airfield in

mountainous terrain killing 87 of the 96 passengers aboard. Taped crew

conversations indicate that the pilots never realized an error had been made.

This example, besides being an example of a “mode error” which will be discussed in detail

later in the paper, is a perfect example of the small-error/large-error tradeoff. Although it

would be difficult for human pilots to fly a perfect 3.3 degree angle of descent, it is unlikely

that in attempting to do so, they would error by such a large magnitude. The automated

system, however, can fly the aircraft at a 3.3 angle of descent nearly free of error, but must

be commanded to do precisely that.

There are abundant examples of such errors being made. However, most are detected

before the error is elevated to a catastrophic level (Billings, 1991; Wiener, 1988). Wiener

(1988) has suggested 4 approaches to dealing specifically with this problem. First, he

suggests that systems need to be less cordial to erroneous input at the human interface level,

rather than depending on training and correct operation to alleviate the symptoms of bad

17
design. Second, he suggests that systems be designed to be less vulnerable to unsafe actions

even in the advent of erroneous input (for example Ground Proximity Warning Systems).

Third, systems must be designed with error checking, or a certain “intelligence” capability

to deal with the logic of inputs given other relevant factors (for example, comparing pilot’s

altitude inputs with an internal terrain map). Finally, Wiener (1988) suggests that the

entire system, including Air Traffic Control, be designed to be less tolerant to overall

system error (e.g., insuring that aircraft follow the exact instructions provided by ATC).

None of the suggestions raised by Wiener (1988) are necessarily easy to implement, but

given the catastrophic outcomes of the “large error” problem in the modern cockpit, many

changes are being made in the direction of these ideals. Although GPWS and TCAS are

now mandatory forms of error checking (Van Cott, et al., 1996), improved FMS user

interfaces and terrain checking are in the process of being perfected. Further, this issue

highlights the importance of “smart” automation as advocated by some individuals in the

field (Van Cott, et al., 1996). Although the assumption is that smart automation would

detect and resolve large errors, ample evidence suggests that this is a complex and difficult

undertaking.

Are Multi-modal FMSs too Complex?

Early auto-pilots were simple devices which could turn to and hold a heading, climb to and

hold an altitude, or track a navigation signal for the purpose of decreasing the need for

continuous hands-on control of the aircraft (Billings, 1991). More important was that fact

that every behavior of the auto-pilot had to be specifically commanded by the pilot;

commands to the auto-pilot were never more than one flight transition away from the

18
current flight condition (e.g., the pilot could command the auto-pilot to turn to a specific

heading and hold that heading, but could not at that time input a future heading change

command). As automation transitioned into its second generation in the late 1950s

(Billings, 1991), automatic control of the aircraft became gradually more sophisticated,

with devices becoming autonomous from continuous pilot command. Examples of such

devices are the yaw damper, which automatically initiates slight rudder movement to

prevent the “Dutch roll” phenomenon in swept wing aircraft, and “pitch trim

compensators” which control the tendency for aircraft to pitch down at near-supersonic

speeds. Although these devices and others like them increased safety and efficiency, and in

some cases, made high speed transport a reality, they also set the precedent that

autonomous automation could be introduced into the cockpit safely and successfully without

undue concern for human factors.

As automation transitioned into its third generation (Billings, 1991), the objective of

integrating and managing the automatic systems to further reduce workload, increase

safety, and increase efficiency lead to FMSs with phenomenal capability. Not only could an

entire flight be preprogrammed into the system, but this execution of the flight could be

undertaken without pilot intervention. Intrinsic to this automation capability was that the

system would have many “modes” available to command the flight. Just as pilots have

several methods to accomplish the same task in an aircraft (e.g., on approach, one can

control altitude with power changes, pitch changes, or approach with the throttles at idle,

and control altitude with pitch and wing spoilers), so too were multiple capabilities built

into FMSs, both for efficiency, and to provide the pilots with greater flexibility. With this

increased ability, however, came a certain need for the automation to deal with peculiar

situations without pilot intervention. This meant that upon reaching certain predetermined

target values or reaching certain “protection limits,” (i.e., the system senses that an unsafe

19
condition has occurred) the FMS can change its “mode” of operation or over-ride pilot

inputs.

Automation with this level of sophistication has led to two specific pilot interaction

problems (Sarter & Woods, 1991). The first problem is that, given the inherent complexity

of the system, greater demands are placed on the pilots to understand the multiple

ramifications of each FMS “mode.” Because a particular mode may behave differently

under different circumstances (e.g., at different altitudes), the pilot must understand in

advance what the FMS will do given certain inputs, and remember what FMS abbreviation

is yielded at any input moment. Consider the following example:

February 14th, 1990, Bangalore, India:

An Airbus A320 flew into the ground while on short approach to Bangalore

Airport in India. The pilots had inadvertently set the auto-pilot to "Idle Open

Descent" mode, which sets the auto throttle to idle, rather than one of the two

descent modes in which auto throttles are active (Aviation Week, 1990).

Consequently, unbeknownst to the pilots, the aircraft slowed to a speed of 25

knots below the desired airspeed of 132 knots since altitude was maintained by

pitch rather than thrust. By the time the pilots realized their error, they were too

slow and too low to recover, and crashed short of the runway killing 94 of the 146

people on board.

The second FMS complexity problem is that the behavior of the automation is contingent

upon certain “situational” factors in addition to pilot inputs, often making it difficult for the

pilots to predict the behavior of the auto-pilot either upon engaging the auto-pilot, or in

20
monitoring its behavior as it progresses along the flight (Sarter & Woods, 1992). Consider

the following example:

April 26th, 1994, Nogoya, Japan:

An Airbus A300-600 stalled 1800 feet above the ground on approach to Nagoya

Airport, Japan, following a chaotic battle for control of the aircraft between the

pilots and the auto-pilot. While flying the aircraft manually with flight director

guidance and auto-throttles engaged, the co-pilot inadvertently engaged the TOGA

(take off/go around) lever on the throttle quadrant. Realizing the error, the captain

correctly instructed the co-pilot to disengage the auto-throttle. In attempting to

correct for the now off-glideslope condition, the pilots engaged the auto-pilots 1 and

2, believing that the auto-pilot would return them to the desired flight path.

Instead, the auto-pilot resumed the TOGA mode which had accidentally been

selected by the co-pilot previously. Realizing this, the pilots applied forward

pressure on the yoke to correct for the auto-pilot induced 18 degree nose up

condition. However, because the FMS software inhibits automatic “yoke force auto-

pilot disengagement” below 1500 ft, the auto-pilot remained engaged and initiated

movement of the “trimmable horizontal stabilizer” in the opposite direction. While

the pilots pushed down with all their strength, the trim system continued to push the

nose upward for twenty seconds until the pilots manually disengaged the auto-pilot.

Several seconds later, the extreme nose up condition and deteriorating airspeed

unexpectedly caused the “alpha floor” protection mode to engage due to excessive

angle of attack. This “alpha floor” condition commanded a thrust increase inducing

an even greater nose-up attitude. Although the captain promptly disengaged the

21
“alpha floor,” the aircraft was far out of trim, the airspeed was at 78kts, and the

altitude 1800 ft. The aircraft stalled and could not be recovered before it hit the

ground, killing 264 people. (Aviation Week and Space Technology, 1996)

Although this incident seems obscure and hardly believable, very similar incidents also

occurred in 1985, 1989, and 1991 (Aviation Week, 1996), and highlight the dangers of

highly capable, yet unintentionally complex auto-pilot systems. Interestingly, however,

such incidents point to an intriguing paradox of automation. As auto-pilots became more

sophisticated, they in fact begin to fly more like humans (albeit more precisely) using a

complex combination of methods to achieve their goal. However, as this occurs, it makes it

more difficult for those pilots monitoring the automation to predict and, in fact, understand

what the automation is doing. Given the flexibility of the FMS and the “dynamism of flight

path control,” serious cognitive demands are placed on the pilots (Sarter & Woods, 1992).

Not only must they decide the level and mode of automatic control, but they must diligently

track its behavior in a highly dynamic environment.

Sarter and Woods (1992, 1994), while seeking empirical evidence for pilots’ anecdotal

suggestions of confusion about the Flight Management System operation, found converging

and complementary data demonstrating both serious gaps in pilots’ understanding of the

system logic and difficulty in tracking the behavior of the FMS while in flight.

Surprisingly, Sarter and Woods (1992) found that 55% of Boeing 757 pilots surveyed

agreed with the statement “In B-757 automation, there are still things that happen that

surprise me.” Further, 20% of the pilots agreed with the statement: “There are still modes

and features of the B-757 FMS that I don’t understand.”

22
In a follow-up study, Sarter and Woods (1994) created an FMS command-laden

experimental scenario which was then flown in a part-task simulator designed to teach FMS

operations. The goal of this study was to observe pilots using the FMS in a simulated

mission to understand pilots’ mental representations of the FMS logic. The results showed

that the majority of pilots had little difficulty with routine operations ranging from

establishing a holding pattern to setting up for an ILS approach. However, they found that

70% of pilots showed deficiencies in one or more of the following less standard procedures:

1. aborting take-off with auto-throttles engaged,

2. anticipating mode indications on the ADI display throughout the take-off roll,

3. anticipating the arming of the go-around mode,

4. disengaging Approach mode after signal “capture,”

5. explaining speed management,

6. defining end-of-descent point for different modes,

7. describing the system behavior differences above and below 1500 ft for a loss of radio

“signal” condition.

An example deficiency was that 80% of pilots did not realize that aborting an auto-throttle

take-off required the pilot to manually disconnect (as opposed to an automatic disconnect)

the auto-throttles in order to prevent them from re-accelerating after manual intervention.

The authors (Sarter & Woods, 1994) attribute these deficiencies to two separate factors.

They see the first three deficiencies related to weak mode awareness, both in terms of

dealing with an FMS related failure and with anticipating system status and behavior. The

second factor, raised by the last four deficiencies, point to an impoverished knowledge of

the “functional structure” of the FMS (Sarter & Woods, 1994). It is quite obvious from

23
these findings (Sarter & Woods, 1994, 1992), and others (Wiener, 1989) that even

experienced pilots have trouble with the complexity of the FMS. Sarter and Woods (1992)

suggest that one of the primary problems with FMS systems is the poor feedback given to

pilots about the behavior of the FMS, exacerbating the already difficult task of predicting

system behavior. In fact, both accidents and empirical investigations have led to

considerable FMS changes (Aviation Week, 1995). However, continuing investigations

persist in revealing that pilots have trouble understanding and predicting FMS behavior,

despite improved feedback, interfaces, and training.

It has also been suggested that any system which is “multi-modal’ in nature is

difficult for human operators (Norman, 1988; Wiener, 1989), and thus problematic

regardless of interface and training issues. Further, the problem remains that even

if a crew has complete understanding of the FMS system, 84% of FMS related

reports to ASRS indicate that “programming errors” still present the highest

problem area (Aviation Week, 1995). The problems of detecting programming

errors is compounded by the fact that pilots have trouble predicting the behavior of

the FMS. Clearly, if some element of “wait and see” is built into pilots’

monitoring behavior, detection of programming errors may be delayed while

attempting to internally justify the FMS behavior. Consider the following

example:

December 20, 1995, near Cali, Colombia:

A Boeing 757 crashed into San Jose Mountain on approach to Cali, Colombia, while

performing a “Ground Proximity Warning System (GPWS) escape maneuver” to avoid

the mountain. A post-crash analysis of flight data and cockpit recordings determined

that the pilots of the aircraft entered a command into the FMS to fly direct to Tulua

24
VOR in order to comply with an ATC request to report their position once over the

VOR. However, the pilots failed to realize that they had already passed Tulua VOR, so

their command caused the FMS to turn the aircraft back in the direction from which

they had come. While the behavior of the aircraft surprised the pilots, they continued

to let the FMS turn the aircraft in the wrong direction for approximately 90 seconds.

With suspicion growing, the pilots switched the auto-pilot to the “heading select” mode

in an attempt to return the aircraft heading toward Cali. However, the 90 second turn

to the left, and then the corrective turn to the right placed the aircraft off course and in

a valley surrounded by high mountains. The pilots attempted a maximum performance

climb after prompting by the GPWS, but the 757 hit the top of a 12,000 ft mountain

killing 164 of the 167 individuals on board. (Aviation Week and Space Technology,

1996)

Only as evidence builds that the autopilot behavior is deviating from expectation will the

pilots begin to suspect a programming error. Although industry has advocated better

training and researchers have advocated better FMS interface and cockpit display design, it

seems likely that mode confusion will persist as long as FMS operation is optimized for

efficiency, and not simplicity.

Is Workload Lower, or Just Different?

As suggested earlier, the clear relationship between high workload and increased

probability for human error has been a strong force in the push toward cockpit automation.

Early systems automation was successful in reducing the workload for pilots (Billing, 1991)

which was welcomed given that pilots “prefer to be relieved of much of the routine manual

control and mental computation in order to have time to supervise the flight more

25
effectively and to perform optimally in an emergency” (Wiener, 1988). Further, airlines

have long desired wide-body aircraft requiring only two crew members, and the reduction of

workload in the cockpit was a prerequisite for such designs.

Defining workload and then measuring it has always been a difficult task for engineering

psychologists (Wiener, 1985), yet aircraft designers were quite confident that heightened

levels of automation would reduce workload to a large degree (Wiener, 1988). Two

interesting factors have arisen since highly automated aircraft were certified for two pilot

operation on the grounds that workload had been sufficiently reduced. First, it seems quite

evident from pilot studies that, while manual workload may have been effectively reduced,

mental workload was not reduced and may have actually increased. This is because the

automation management task is now part of pilot duties (Wiener, 1988).

Wiener (1988) suggests that automation now calls for more programming, planning,

sequencing, and alternative selection, all of which add up to considerable levels of cognitive

processing. In fact, a study by Curry (1984) of 100 Boeing 767 pilots found that only 47%

agreed with the statement, “automation reduces overall workload.” Responding to another

question, 53% of the pilots agreed with the statement, “Automation does not reduce

workload, since there is more to monitor now.” Although subjective in nature, it is clear

that many pilots find the automation management task quite demanding and perhaps more

demanding than the highly manual flying task which the automation replaced (Kantowitz

& Casper, 1988). Not only does this bring into question the validity of the certification

findings, but it implies that if the relationship between workload and error still exists in the

automated cockpit, automation may now be affording new opportunities for human error

based solely on increased workload levels.

26
Another factor related to automation induced workload is the temporal spacing of workload

throughout the different phases of the flight. Automation has reduced the workload in

cruise phases of flight to almost nothing (Billings, 1991). However, workload tends to

increase dramatically upon entering the “terminal” area because of two factors. First, since

terminal area flight usually requires some combination of directional, altitude, and speed

changes, not to mention potential FMS mode changes, the monitoring task becomes much

more involved. The pilots must monitor the aircraft’s behavior in an attempt to stay ahead

of the FMS, and they must also monitor the FMS commands to insure that the information

programmed into the FMS at the beginning of the flight is correct. The second factor in

increased terminal area workload is the commonly cited mismatch between Air Traffic

Control procedures and the FMS (Wiener, 1985). Assuming that ATC requires deviation

from a standard approach, which is often the case, the pilots must spend “heads down” time

reprogramming the FMS, while maintaining constant communication with ATC and

scanning for other aircraft. In the future ATC may communicate directly with an aircraft’s

FMS, thus reducing both communication and programming errors and allowing the pilots

greater opportunity for scanning for other aircraft, but this feature is still several years

away.

Automation Induced Complacency

Although traditionally a somewhat ill-defined concept, complacency is often mentioned as a

potential negative effect of automation as related to monitoring performance (Parasuraman,

Molloy, & Singh, 1993; Thackray & Touchstone, 1989). In addition, the term complacency

has been used to describe inadequate cockpit performance previous to highly automated

cockpits. Wiener (1981) has defined complacency as “a psychological state characterized

by a low index of suspicion,” while the ASRS coding manual defines complacency as “self-

27
satisfaction which may result in non-vigilance based on an unjustified assumption of

satisfactory system state,” (Parasuraman, et al., 1993). Singh, Molloy, and Parasuraman

(1993) in the development of a complacency rating scale, viewed complacent behavior as

one’s attitude toward automation coexistent with other factors. Singh et al. (1992) found

four independent factors revealing a potential for complacency, those being confidence,

reliance, trust, and safety related complacency.

Overconfidence in automation may not, however, be a strong enough factor itself to cause

complacency. Although Thackray and Touchstone, (1989), attempted to induce the effects

of complacency by having a reliable automated Air Traffic Control task fail both in the

beginning and end of a two hour experimental session, they failed to show a reliable

performance difference between the two failures. Further, their research did not yield a

difference in detection efficiency between the group with automated assistance and the

group who performed the task without assistance. Thackray and Touchstone (1989)

reasoned that their failure to find a difference may have been due to the short session, or

perhaps because the subjects performed only a monitoring task, with no other tasks

competing for resources. Parasuraman et al. (1993) reasoned that the effects of automation-

induced complacency are more likely when the operator is responsible for many functions,

as is often the case in aircraft incidents in which complacency was a factor. Singh et al.

(1993) reasoned that complacent behavior exists only when both a complacency potential

exists on the part of the pilots, and certain other factors coexist. Those factors include pilot

inexperience, fatigue, high workload, and poor communication (Singh, Molloy, &

Parasuraman, 1993).

Based on the reasoning that high workload may cause automation induced complacency,

Parasuraman et al. (1993) had subjects detect failures of an automated system monitoring

28
device while those subjects controlled a fuel management system and a tracking task.

Automation reliability was manipulated with groups either seeing high or low constant

reliability or variable reliability automation alternating from high to low every ten minutes.

Additionally, subjects were placed in a “monitor only” group or were in a group which

monitored and controlled all tasks. Results clearly showed that detection of automation

failures was worse for subjects in the constant reliability condition. Results also showed

that subjects whose only task was to monitor showed no performance differences due to

automation reliability. This finding supported earlier findings (Thackray & Touchstone,

1989) that workload must reach a certain level before complacency related performance

deficits will be seen. The authors viewed these results as the first evidence that automation

induced complacency could be produced by high workload and highly reliable automation.

These findings are significant in the operational setting because workload can be very high

at certain times and the automation extremely reliable. The problem however, is that the

automation is not perfectly reliable. As discussed earlier in this paper, pilots often enter

incorrect information into the FMS which then diligently carries out exactly what it is

commanded to do. In addition, the automation is capable of failure even when the correct

information is entered into the system. Consider the following example cited to this author

from a 767 captain:

While in cruise over the Mediterranean en route from London to Cairo, the pilots

of a Boeing 767 monitored as the FMS flew the aircraft. Unbeknownst to the

pilots, the auto-throttle was gradually but erroneously reducing the thrust being

supplied from the engines. While this was occurring, the auto-pilot was

responding to the reduced thrust by pitching the aircraft up ever so slowly to

maintain the altitude specified by the FMS. Because of the moderate rate of thrust

29
reduction and smoothness with which the auto-pilot responded, the pilots failed to

sense the cues normally associated with changes in pitch. Fortunately, the captain

eventually noticed that the airspeed was unusually low, and manually accelerated

the throttles. However, by the time the anomaly was noticed, the airspeed had

dropped 25kts below the appropriate cruise speed, and only 15kts above stall

speed.

The problem of automation induced complacency is that the complacency is unjustified in

operational settings. Not only does automation simply fail to perform correctly, but

programming errors and FMS misunderstandings by the pilots create an environment

where complacency is unjustified.

Situational Awareness

As the issue of out-of-the-loop performance has become increasingly important (Endsley,

1995), new terminology, research methods and constructs have evolved to suit this research

area. Of these, the concept of “situational awareness” has evolved as a means of both

conceptualizing the problem, and, in fact, measuring it. The use of situational awareness as

a causal agent is strongly supported by some (Endsley, 1995) or used only as a label for a

variety cognitive processing activities by others (Sarter & Woods, 1995). It is viewed as a

“buzzword of the ‘90s,” rather than an effective research paradigm (Wiener, 1993), and

viewed as an obstacle to research, rather than as a phenomenon description, by still others

(Flach, 1995). Because of the effort dedicated to this research paradigm in both civilian

and military settings, situational awareness will be treated as a valid construct for the

purposes of this paper, discussing how it has been used to conceptualize and measure the

“out of the loop” performance problem.

30
Although there have been numerous definitions proposed for situational awareness, most

have not been applicable across different task domains (Endsley, 1988). However, the

definition settled on by the most prolific researcher in this area is as follows (Endsley,

1995): “Situational awareness is the perception of the elements in the environment within a

volume of time and space, the comprehension of their meaning, and the projection of their

status in the near future.” Further, Endsley (1995) has divided situational awareness into

three hierarchical levels. Level 1 situational awareness is described as the perception of

elements in the environment. Those-task specific elements include the status, attributes,

and dynamics of the environment which are specifically pertinent to effective performance.

Level 2 situational awareness is the comprehension of the situation based on the synthesis

of disjointed Level 1 elements. Most importantly, however, is the fact that this level of

comprehension includes an understanding of the significance of the perceptual elements in

light of the operator’s goals, providing a holistic picture of the environment to the operator.

Level 3 situational awareness is the ability of the operator to project the future actions of the

elements in the environment based on level 2 situational awareness. The highest level of

situational awareness “is achieved through knowledge of the status and dynamics of the

elements and comprehension of the situation,” (Endsley, 1995).

Although relatively little research has been conducted using situational awareness as the

dependent variable, Endsley and Kiris (1995) used an expert-system-aided navigation task

to study the effects of differing levels of automation on workload and situational awareness.

Using five levels of automation, the authors hypothesized that both workload and

situational awareness would decrease as the level of automation increased. Measuring a)

decision time upon automation (expert system) failure, b) decision selection, c) decision

confidence d) workload and e) situational awareness, the authors found that as the

31
automation level went up, decision time following an automation failure also went up.

Further, situational awareness also went down as the level of automation went up,

confirming the authors’ hypotheses. Interestingly, however, only level 2 situational

awareness was affected by automation, leading the authors to speculate that subjects who

relied on automation may not have developed a higher level of understanding of the

situation. Significantly, workload levels were unaffected by automation, mirroring other

research and anecdotal findings (Billings, 1991) that automation does not necessarily

correlate with reduced workload. Surprisingly, higher confidence levels corresponded with

higher levels of automation, even though their decision times were longer and situational

awareness lower.

Whether or not one supports the use of situational awareness as a theoretical construct, or

as merely a general descriptive concept, there is little doubt that the research that has been

done has successfully captured some difference in operator knowledge based on automation

level. Further, in terms of conceptualizing and communicating the nature of the “out of the

loop” performance problem, research in this area has been beneficial. Most importantly,

however, if one looks at this research as part of a body of research which has attempted to

measure operator performance in terms of level of automation, the findings are generally

consistent with other research in the field demonstrating reduced operator efficiency when

placed “out of the loop” (Johannsen, Pfendler, & Stein, 1976; Kessel & Wickens, 1982;

Wickens & Kessel, 1979-1981; G. Young, 1995; Young, 1969).

32
Mental Models

The concept of the “mental model” as an explanatory device for human cognition is not a

new one, nor is interest in mental models exclusive to cognitive and engineering

psychology (Wilson & Rutherford, 1989). In fact, mental models have been used as an

explanatory construct in manual control literature for over thirty years (Rouse & Morris,

1986). This body of literature commonly used the phrase “internal model” to describe the

“images” that individuals use to organize and execute daily procedural activities or to

operate complex devices (Jaginski & Miller, 1978). While the originator of the mental

model notion is likely Kenneth Craik (1943), Johnson-Laird (1983) instantiated and

popularized the notion of the mental model (and in fact the more sophisticated and

relationally complex conceptual model) as a legitimate construct for cognitive psychology.

Johnson-Laird’s (1983) conceptualizations of mental models gave way to a more open

embrace of this concept by cognitive psychology in the early eighties (Rouse & Morris,

1986). Interestingly, however, while the manual control commentary viewed this concept

as generally self evident (Rouse & Morris, 1986) and therefore a suitable assumption which

allowed calculations of expected control performance, the cognitive community focused

more directly on the “mental model” as a phenomenon (Rouse & Morris, 1986), even

though Johnson-Laird’s largely functional approach avoided issues of fundamental mental

processes (Wilson & Rutherford, 1989). Norman (1988) explained peoples’ interactions

with and understanding of devices by distinguishing between conceptual models,

characterized as the appropriate model which a system designer desires the operator to

have, versus the mental model, which is what the operator actually develops through device

interaction.

33
Even though the use of the concept of a mental model is fairly common in the literature, it

has suffered from a lack of explicit definition (Rouse & Morris, 1986). Johnson-Laird

(1981) stated “A [mental] model represents a state of affairs and accordingly its structure

[which] plays a direct representational or analogical role. Its structure mirrors the relevant

aspects of the corresponding state of affairs in the world.” Rouse and Morris (1985) have

defined a mental model as: “mechanisms whereby humans are able to generate descriptions

of system purpose and form, explanations of system functioning and observed system states,

and prediction of future states.” Carroll and Olson (1987) have defined mental models as

“a rich and elaborate structure, reflecting the user’s understanding of what the system

contains, how it works, and why it works that way. It can be conceived as knowledge about

the system sufficient to permit the user to mentally try out actions before choosing one to

execute.” Borgman (1986) summarizes the perspective of the research in the human

computer interaction community on mental models as “a general concept used to describe a

cognitive mechanism for representing and making inferences about a system or problem

which the user builds as he or she interacts with and learns about the system. The mental

model represents the structure and internal relationship of the system and aids the user in

understanding it, making inferences about it, and predicting the system’s behavior in future

instances.” Endsley (1995), in her development of the situational awareness paradigm,

states that a well developed mental model provides: (a) knowledge of relevant system

elements that direct attention and classify information in the perceptual process, (b) a

means of integrating elements to form an understanding of their meaning, and (c) a

mechanism for projecting future states of the system based on its current state.

Regardless of the specific author, however, most definitions contend that a mental model is

a form of subjective representation of external reality, and allows implicit use of the model

34
for the purpose of “thinking” about the system. This fortunately renders them functional

and affords the user some explicit, though limited, ability to consciously run the model.

Equally important, however, is the notion that a user’s mental model is seldom a perfect

analogy to the real system, and is “surprisingly meager, imprecisely specified, and full of

inconsistencies, gaps, and idiosyncratic quirks,” and quite often possesses blatant

superstitions (Norman, 1983).

The purpose of this discussion, however, is not to review the relative merits and theories of

mental models, but rather to discuss the way in which the general conceptualization of

mental models is useful in understanding the “out of the loop” performance problem in

highly automated aircraft. Aviation presents itself as a unique domain for the study of

mental models primarily for two reasons. First, in aviation nearly all of its operators,

especially those in the commercial domain, can generally be considered domain experts.

Further, not only are its participants highly trained and versed in aviation related concepts,

but all must perform a nearly identical task. This is not to say that all pilots have identical

mental models, or that their models are a perfectly balanced representation of the real

system. However, as a population of experts they most certainly have very similar models

of the system, and their models are by necessity fairly accurate representations.

The second factor that makes aviation unique is the high level of complexity which must be

part of the flight task mental model. Not only must the pilot’s model include the traditional

manual controlling model in order to fly the aircraft, but the pilot must also have the

aircraft systems, airspace system, air traffic control system, communication, navigation, and

most importantly, the current dynamic state of the aircraft in relation to all the other

systems as part of that model. This notion is not unlike the perspective held by Williams,

Hollan, and Stevens (1983) that mental models are composed of autonomous objects with

35
an associated topology; an autonomous object being a mental object with an explicit

representation of state, set of rules governing parameters, and an explicit representation of

its topological connections to other objects. In addition, I propose that there must be two

levels of the same model: a static, schema like model of the system, and a real-time

dynamic execution of the model.

The static model is much like any operator’s model of a particular device, allowing the pilot

to clearly describe the operations of all the systems and the relationships between those

systems. When flying, however, the static model is the basis for the activation of the

dynamic execution. The dynamic execution is, in essence, the activation of the static model

with variable data entered into hypothetical “slots.” The activation of this model, however,

is not uniform, but rather a system with varying levels of activation in which components of

the model that are required for efficient task completion are most activated, with those

components unnecessary to the task remaining relatively inactive. As the operator

accomplishes the task, those areas of the static model which have become activated remain

that way for some time even when the task no longer supports the activation of the model.

The areas of activation provide the pilot with quick and easy access to those areas, and

benefit the pilot through more efficient cognitive and perceptual processing of features

related to the activated areas.

For example, a pilot, while on the ground, can explain the relationship between pitch,

power, altitude and airspeed. While flying, and especially while initiating a descent, the

pilot must use the information is this model, in combination with elements of the present

dynamic environment (i.e., current airspeed, throttle setting, pitch and altitude) in order to

execute the descent properly. I contend that during the execution of a task, in this case the

execution of a descent, the relevant portions of the static mental model and all their

36
associated elements become activated. Not only does activation allow for proper execution

of the task but, according to Endsley’s (1995) description of a well developed mental model,

“the model will provide (a) for the dynamic direction of attention to critical cues, (b)

expectations regarding future states of the environment (including what to expect as well as

what not to expect) based on the projection mechanisms of the model, and (c) a direct,

single-step link between recognized situation classification and typical actions.”

The unique feature of the commercial pilot, however, is that he has a well developed mental

model of the flying task, yet there are frequent examples (this paper and Billings, 1991) of

pilots failing to perceive and integrate information as would be expected given the quality

of their mental model (and its supposed level of activation). Most importantly, Endsley

(1995) points out that a mental model should provide for the dynamic direction of attention

to critical cues. Yet it often seems that pilots fail to attend to critical and sometimes life

threatening cues which should be perfectly salient. This dynamic execution theory of

mental models predicts that any weakening of activation would hinder the operator’s ability

to perceive critical elements in the environment, and would thus lead to conditions in which

critical cues are not perceived and integrated into useful information.

The proposed theory is somewhat similar in vein and at least tangentially related to

Neisser’s perceptual cycle (Neisser, 1976) as put forth by Adams, Tanney, and Pew (1995)

in their conceptualization of situational awareness as an active cognitive process. Neisser

views perceptual acuity and efficiency as a function of cognitive structures available at the

time of perception. Neisser states, “Because we can see only what we know how to look for,

it is these schemata (together with the information actually available) that determine what

will be perceived. At each moment the perceiver is constructing of certain kinds of

information, that enable him to accept it as it becomes available” (p. 20). The cyclic nature

37
of Neisser’s theory implies that each perceptual event results in a modification of the

schema which then “directs further exploration and becomes ready for more information.”

Neisser’s theory suggests that effective perceptual activity is contingent upon the quality

and nature of the previous perceptual cycle. If the activity undertaken by an individual is

different from the activity suggested by the operator’s primary goal, then the perceptual

cycle which proceeds may be ineffective for guiding perceptual activity. According to the

dynamic execution theory, a mental model which remains fairly static (e.g., when the

operator inactively monitors for long durations) will likely lead to a perceptual system

unprepared for the consumption of critical information, or perhaps prepared for the wrong

information. Neisser’s perceptual cycle (1976) also suggests that as the operator’s task

shifts there should be a transitory period during which an inadequate perceptual cycle must

be replaced in favor of a more appropriate, and thus effective, perceptual cycle.

The next section will review previous research from controlled empirical studies which

examined human monitoring behavior in manual and automated systems, some of which

allude directly to the notion of mental models or similar concepts. In fact, Endsley and

Kiris (1994) suggest that some forms of manual control may lead to “maintenance” of an

operator’s mental model. While certainly related, such suggestions are problematic given

the quality of the mental model possessed by experienced pilots. Further, although there is

an implication in some studies that manual control improves cue sensitivity (Johannsen,

1976; Kessel & Wickens, 1982; Wickens & Kessel, 1979-1981; G. Young, 1995; Young,

1969), such claims are generally not explicit.

I hope not only to shed some new light on the results of past research using the dynamic

execution theory, but present new experimental results which further support the hypothesis

38
that a prerequisite of effective sensitivity to key elements in a dynamic process

environment, and correct integration of and response to those elements, is contingent upon

mental model activation stimulated by in-the-loop control.

Relevant Research

Given the proliferation of automation in modern cockpits, and the anecdotal and theoretical

support for the view that automation in cockpits should be approached cautiously (Billings,

1991), there is surprisingly little controlled, empirical research dealing with this issue.

Most of the research comparing monitors and controllers in automated, dynamic tasks has

employed tracking or flight control tasks with simulated flight dynamic shifts implicating

control system failure (Johannsen et al., 1976; Kessel & Wickens, 1982; Wickens & Kessel,

1979-1981; G. Young, 1995; Young, 1969) or actual flight tasks with a failure of the

automated system (Eprath & Curry, 1977). Others have used cognitively oriented decision

making tasks (Endsley, 1995; Parasuraman, Molloy, & Singh, 1993; Thackray &

Touchstone, 1989). Findings from these flight- and tracking-task experiments have

generally supported a failure detection performance advantage for system controllers

(Endsley, 1995; Parasuraman et al., 1993; Johannsen, Pfendler, & Stein, 1976; Kessel &

Wickens, 1982; Wickens & Kessel, 1979-1981; Young, 1995; Young, 1969) although some

have found either no difference (Thackray & Touchstone, 1989) or a failure detection

advantage for monitors (Eprath & Curry, 1977). These problematic research findings have

been attributed to a task which was unnecessarily biased for the system monitors (Young,

1995), having an experimental paradigm in which workload was too low (Parasuraman,

Molloy, & Singh, 1993 ) or experimental trials that were too short in duration (Thackray &

Touchstone, 1989) to show any effects of impoverished failure detection performance by

monitors.

39
The methodological approach used in the present research is based on studies which found

superior failure detection performance for manual controllers on a tracking task (Kessel &

Wickens, 1982; Wickens & Kessel, 1979-1981; Young, 1995; Young, 1969). More

importantly, those subjects who controlled manually were also better at detecting failures

when they then transferred to the monitoring task. The transfer effects found in this

research offer the strongest evidence that differences between monitors and past-controllers

may be related to differences in subjects’ mental models of dynamic systems. The next

section will therefore focus primarily on related experiments in which monitors and

controllers were compared in a transfer condition. These findings are both theoretically

and operationally more significant and are the basis for the current research.

Paradigm History

Young’s (1969) single axis tracking task which found superior performance for controllers

was improved and expanded by Wickens and Kessel (1979) who designed a similar

experiment using a two dimensional pursuit tracking task to increase task complexity.

They also addressed concerns that Young’s (1969) auto-pilot methods may have been

biased against monitors by implementing a non-adaptive auto-pilot, making visual detection

of failures easier for monitors. Their results demonstrated that as subjects switched from

controlling to monitoring, latency of failure detection decreased considerably, while

accuracy increased slightly. Wickens and Kessel (1979) determined that the superior

performance of the controllers was due to the additional channel of proprioceptive

information available to controllers as they adapted to the failed condition. Interestingly,

analysis of latency distributions demonstrated that controllers only maintained an advantage

in the first few seconds after the onset of a failure. Because of the short, transient nature of

40
a proprioceptive standard, detection must occur within the first seconds, otherwise subjects

resorted to the visual channel for failure information. This finding further strengthened

their argument that the controller advantage was a result of proprioceptive feedback.

In addition to the proprioceptive channel available to controllers, the authors hypothesized

that superior performance may have been due, in part, to a more consistent conceptual

model of the dynamic system. Less variability in a subject's conceptual representation of a

system should enhance the subject's ability to detect deviations from a normal state. This

was based on the view that a conceptual model of greater consistency developed as a result

of the controller's ability to differentiate between one’s own inputs and those acting upon

the system externally (e.g., turbulence). In addition to having an internal model of greater

consistency, it was believed that controllers in a system could test hypotheses about the

general state of the dynamic system through subtle system inputs, reinforcing and testing

the features of that model (Kessel & Wickens, 1982).

In order to determine the role of workload in failure detection performance, Wickens and

Kessel (1979) employed a secondary task in their experiment. As the side task was added

and its difficulty increased, no marked decrease occurred in detection performance of either

controllers or monitors, demonstrating a lack of interaction between participatory mode and

workload. Instead, higher levels of workload shifted the speed-accuracy bias toward speed

rather than accuracy.

Wickens and Kessel’s (1979) finding raised the question of why the increased workload of

the manual tracking task did not have a negative impact on failure detection performance

like that found by Ephrath and Curry (1977). Wickens and Kessel (1980) pointed out that

the manual tracking task and the failure detection task may not be competing for the same

41
resources, as had been previously believed. Moreover, the activation of the resources

allocated to the tracking task were those that allowed subjects to utilize proprioceptive

feedback in the detection process. This suggested that these operations work in

cooperation, rather than in competition, with each other. Using the same experimental

paradigm, but employing multiple secondary tasks focusing on structure-specific resource

allocation, Wickens and Kessel (1980) concluded that controlling and monitoring actually

rely on different processing resources to detect failures. Failure detection while monitoring

relies on perceptual/central processing resources, because monitoring is primarily a visual

task. However, while controlling subjects rely on a response-related reservoir separate from

central processing resources because of the proprioceptive nature of the task.

Although evidence and theory suggested that a subjects’ conceptual representation of a

system may positively affect failure detection performance of controllers, previously

mentioned studies (Wickens & Kessel, 1979; Young, 1969) employed repeated measures

designs that had subjects perform both monitoring and controlling of the failure detection

tasks. Therefore, even if subjects developed a performance enhancing conceptual

representation while controlling, this advantage would have been available to the subjects in

either participatory mode.

Wickens and Kessel (1979) hypothesized that concurrent development of both a controlling

and a monitoring conceptual model negatively affected the performance of controllers. This

was based on evidence suggesting that visual information caused a reduced sensitivity to

proprioceptive information, especially when the two sources contradicted each other

(Posner, Nissen, & Klein, 1979). Therefore, because of a strictly visual-cue based model

developed while monitoring, subjects may have had the tendency to rely on faulty visual

cues while controlling. This bias toward visual cues when the two information sources were

42
in conflict therefore negatively affected the performance of controllers. Of course,

switching to a between-subjects design eliminated this problem.

Kessel and Wickens (1982) isolated the impact of subjects' conceptual representations on

their failure detection performance by switching to a between-subjects, transfer-of-training

design. In this study, three groups of subjects were used: the first group transferred from

controlling to monitoring, the second group transferred from monitoring to controlling, and

the third group monitored in both sessions. Consistent with expectations, monitors took

longer to respond to system failures and made more errors than controllers. Further, the

magnitude of improvement of controllers versus monitors with this design was

approximately five times that found in the previous repeated measures designs, thus

confirming the view that the monitor/controller conceptual model bias had been

undermining controller performance and perhaps aiding subjects while monitoring.

The most powerful demonstration of the importance of conceptual representations,

however, was the significant increase in the performance of subjects during monitoring who

had controlled during the first session (Kessel & Wickens, 1982). This result indicated that

controlling not only led to the development of a conceptual model that aided in detecting

failures, but that the model was powerful enough to affect performance on a task which no

longer supported the features of that particular conceptual model. From the standpoint of

dynamic process control, this finding suggests that many of the benefits of automation can

be utilized while allowing the operator, through proper training, to maintain a conceptual

model optimal for detection of subtle changes in system performance (Young, 1995).

Kessel and Wickens’ (1982) transfer of training design was replicated by Young (1995),

who improved on the design by implementing a yoking procedure that insured identical

43
visual stimuli for both controllers and monitors, thus eliminating auto-pilot induced biases.

Further, Young (1995) addressed concerns that Kessel and Wickens’ (1982) transfer effects

may have been attributable to simple vigilance factors and not conceptual model differences

by creating a condition with a high rate of failures and a very short trial length (80 failures

in just over six minutes). If the earlier studies’ results represented merely vigilance related

effects, then the results of a very short experiment with a high rate of failures would not

find the same increased performance for controllers.

Young (1995) successfully replicated Kessel and Wickens’ (1982) results, showing that

when active controllers are transferred to the monitoring task they are better at detecting

failures than subjects who only monitored. This was additional evidence that features of the

controlling task transfer to the monitoring condition, and Young’s (1995) yoking

methodology insured that both controllers and monitors, when compared directly, received

identical visual stimuli. Young (1995) also found a nearly identical pattern of results when

the experiment was reduced in length from 45 minutes to just over six minutes. This

finding further supported the hypothesis that the improved failure detection performance

was due to an improved conceptual model guiding focus to relevant visual cues.

Present Research

Taken together, the results of Kessel and Wickens (1982) and Young (1995) strongly

suggest that individuals who control a simple dynamic system have an advantage in

detecting failures of that system when monitoring compared to individuals who only

monitor. Further, this research suggests that controllers develop a conceptual model of the

system which makes them more sensitive to subtle cues implicating system failure.

Although these findings are significant, they are limited in scope given the largely psycho-

44
motor nature of the tracking task employed. Although controllers may in fact have a more

effective “conceptual model” of the system, this model bears little resemblance to the

multiple, schema-based mental model possessed by operators of complex dynamic systems

Although the pilot of an aircraft, for example, may have a motor schema for manual control

of the aircraft, this is but one component of a mental model of far greater complexity. An

operator of a two dimensional tracking task has essentially one display to guide his control,

yet the aircraft pilot has multiple displays to track, not to mention 6 six degrees of freedom

rather than two, and out-of-cockpit, tactile, and aural information to guide his control.

The objective of the present research is two-fold. In Experiment 1 I will attempt to

replicate the findings of Kessel and Wickens (1982) using a more complex, non psycho-

motor aviation-like dynamic task. This experiment not only seeks to replicate the original

finding that controllers show better monitoring performance, but also to validate this

paradigm as an improved experimental platform for exploring the idea that a better

“conceptual” model is responsible for improved performance.

The primary objective in the design of the experimental paradigm was to create a task

which supported “inferential monitoring” (Wiener, 1984; Parasuraman, 1986). In such

tasks, the monitor collects data from the display, each sample being regarded as a

“sequential sample from a population of known parameters” (Wiener, 1984). As the

monitor of this system, the operator entertains a “rolling null hypothesis” that system

parameters have not changed, but responds when some change in the parameters has been

detected.

Although the particular task is generated from aviation type components, the combination

of these tasks is synthetic, and simple enough so that an individual can acquire the basic

45
principle and operational requirements in a half-hour of training. The task is, however,

highly analogous to many forms of dynamic process control where a failure of some sort is

not reflected in a single value, but rather in an apparent shift in the population mean

(Wiener, 1984) and thus inferential in nature. The task was designed so that failure

detection requires a synthesis of several features of the task making detection from a single

signal nearly impossible.

Based on the view that aircraft pilots have a reasonably complex mental model of the flying

task and that numerous subtleties are built into this model (e.g., the sensory stimuli one has

while initiating a descent), every effort was made to include operational subtleties as part of

the system. These subtleties would, at least in theory, become part of the operator’s mental

model. Further, a mastery of these subtleties would enhance one’s ability to infer a failure

since system subtleties initially have the effect of masking actual system behavior, but lose

and eventually reverse this effect as proficiency with the system increases.

Creating a paradigm that requires inferential monitoring for effective failure detection

would provide evidence that a more effective mental model can assist in the detection

process. However, such a finding would not necessarily exclude a general vigilance

explanation. Therefore, a second failure detection task was added which would represent

the more traditional signal/no signal vigilance task. This failure type was represented by a

bold red indicator surrounding a fuel pump and is analogous to a sub-system indicator light

illuminating in a cockpit. Because indications of the failure are explicit and unpredictable

from system behavior, this failure is completely and intentionally non-inferential.

In addition to the two failure types, this experiment employed two different auto-pilot types.

The first type of auto-pilot was the “yoked” type as used originally by Young (1968) and

46
later Young (1995), in which monitors’ visual stimuli consisted of recorded representation

of controllers’ performances. This method is experimentally superior, as it insures that

visual stimuli received by both controllers and monitors are identical in all conditions.

However, it has the disadvantage of providing visual stimuli which, in terms of auto-pilot

like behavior, is unrealistic. Thus, effects found could be criticized in terms of validity,

since monitors of real dynamic process control systems typically see the system operated in

an optimally efficient manner. For this reason, a third condition was added that used an

“optimized” auto-pilot. This optimized auto-pilot system regulated pump activity in a

highly efficient manner so that fuel levels were always within the “safe” areas, and the

throttle setting perfectly mirrored recommended throttle settings.

Experimental Hypotheses, Experiment 1

This experiment used a completely new experimental paradigm to test well-replicated

findings that past controllers make efficient monitors (Johannsen, et al., 1976; Kessel &

Wickens, 1982; Wickens & Kessel, 1979-1981; Young, 1995; Young, 1969). This research

is thus somewhat exploratory in nature. However, it is expected that the findings will again

show that controllers, when compared directly with monitors, are superior at detecting

failures. It is possible, however, that the higher workload resulting from the controlling

task will negate some of the typical benefits of controlling (e.g., hypothesis testing,

conceptual model improvements).

More importantly, however, it is expected that controllers will be more efficient monitors

when compared to individuals who monitor in both conditions. Further, these differences

should appear only in the inferential monitoring task, and not in the simple explicit

detection task. This expectation is a result of the hypothesis that the improved controller

47
performance observed in the past is due to an activated mental model guiding subjects to

subtle system cues. It is also hypothesized that any differences between controllers and

monitors will be seen in both yoked and optimized auto-pilot conditions, since both auto-

pilot types have in the past shown differences between monitors and controllers.

48
EXPERIMENT 1

Method

Subjects

Thirty-eight right-handed male university students were used in the experiment. Students

were paid a base rate for their participation in the experiment. Additionally, subjects were

given the opportunity to earn a five dollar bonus for good performance. All subjects had

20/20 or corrected 20/20 vision.

Apparatus

A 50 MHz Intel 486 PC with a 17 inch color CRT display was used. A spring centered,

dual-axis hand control (CH Products FlightStick) with a finger operated trigger was

connected to the PC via a 12 bit A/D converter. The subjects sat in a cushioned, semi

reclining chair, with a rest supporting their arm and the “joy stick.” The seating position

yielded an eye to display distance of approximately 100cm. The room containing the

apparatus was darkened, with primary light being provided by a red bulb for the purpose of

simulating a night cockpit environment.

Task

A discreet, single dimension tracking task was used in combination with a fuel

management task in the aviation-based simulation (see Appendix A.) The display

contained a “pictorial” representation of an aircraft fuel system with tanks in each wing,

two in the front, and two in the rear of the aircraft. Fuel tanks were interconnected with a

49
series of symbolic fuel lines showing fuel flow direction, and boxes on the fuel lines

represented pumps which were either on or off. The fuel management portion of the task is

similar to the Multi-Attribute Task Battery (Comstock & Arnegard, 1992) fuel management

task used by Parasuraman et al. (1993, 1996). The throttle level and recommended throttle

setting which made up the discrete tracking task were located in the right portion of the

display and the aircraft’s speed was displayed digitally in the nose of the aircraft.

The single dimension, discreet tracking task required subjects to use the joy stick in order to

match the aircraft’s current throttle level with the “recommended throttle setting” level.

The current throttle setting was indicated by a yellow bar, while the “recommended throttle

setting” was indicated by an adjacent blue bar. Throttle position directly controlled the

displayed speed of the aircraft, which was explicitly displayed in the nose of the aircraft but,

more importantly, throttle position controlled the amount of fuel consumed by the aircraft.

The relationship between throttle position and speed was linear, but the relationship

between speed and fuel consumption was non-linear. Therefore, higher throttle positions

resulted in greater fuel consumption at an increased proportion of fuel to speed. This

speed/fuel consumption relationship meant that doubling the speed, for example, resulted in

greater than double the fuel consumption.

The fuel management task involved the on/off manipulation of six fuel pumps which

controlled fuel flow between fuel tanks. Subjects manipulated the fuel pumps by toggling

keys on the keyboard which were both mapped to the general layout of the fuel pumps, and

were labeled with a specific fuel pump number. The fuel management task required

subjects to manipulate the fuel transfer pumps in order to keep fuel levels in the four main

tanks at “safe” levels, indicated by yellow bars on the fuel tanks. Subjects were told that

50
their task was to pump fuel out of the wing tanks and into the front and rear tanks so that

aircraft balance would remain in equilibrium.

The task was made more difficult by three subtle features of the system. First, as mentioned

earlier, although fuel depletion from the rear tanks was controlled by the speed of the

aircraft, the relation between aircraft speed and fuel consumption was non-linear, so that

subjects had to pay close attention to the throttle level in order to predict fuel consumption.

Second, the fuel tanks, although pictorially similar in size, had different fuel capacities so

that a single pump would have a different effect on the amount of fuel displayed between

the two tanks being affected. Third, the fuel pumps had different flow rates, so that a

pump’s flow rate was contingent upon the location of that pump in the fuel system.

Two types of failures occurred in the system, each representing a different type of fuel

system failure (see Appendix A.) The first type of failure, the signaled pump failure, was

indicated by the symbolic pump border changing from thin white to a highly salient thick

red. Subjects had five seconds to detect this failure. If the failure was detected, or time

expired, the red border returned to white. Subjects were told that a pump failure indicated

a problem with a pump, but that pressing the trigger returned the pump to normal

functioning. A pump failure was totally unrelated to the pump or fuel tank behavior, and

was thus only detectable by the change in the fuel pump border.

The second failure type was the inferential failure, called a “pressurization” failure, and

was indicated by abnormal behavior of fuel levels within the four fuel tanks. A

pressurization failure occurred when the fuel level in one of the four main tanks increased

or decreased in a manner inconsistent with what would be expected given: a) fuel pump

activity and b) rate of aircraft’s fuel consumption. This task was made more difficult by the

51
subtle system features mentioned previously. Subjects had 16 seconds to detect a

pressurization failure. If the failure was detected, the abnormal fuel flow behavior stopped.

If the failure went undetected during the 16 second failure duration window, the abnormal

fuel flow ceased and the tank level remained at its new level.

Experimental Design

Three groups participated in the transfer of training, between-subjects design. The first

group controlled the first day of the experiment, the second group monitored in the

“optimized” auto-pilot condition, and the third group monitored in the “yoked” auto-pilot

condition. The second day (transfer day) all three groups monitored both the auto-pilot and

yoked conditions in four 14 minute trials with two counterbalanced trials of each

monitoring condition.

The experimental portion of each day consisted of four 14 minute trials with a two minute

break between each trial. Each 14 minute trial had seven pump failures and seven

pressurization failures. Failure type and failure sequence was randomized, and time

between failures was between 20 seconds and three minutes. (See Figure 1).

52
Participatory Mode, Experiment 1

Session 1 Session 2
Day 1 Day 2 (monitoring)

Controllers Auto-pilot
Control
Yoked

Auto-pilot
"Auto-pilot" monitors
Auto-Pilot
Yoked

Auto-pilot
"Yoked" monitors
Yoked
Yoked

Figure 1. Experimental design, Experiment 1. Session 2 counterbalanced by condition.

Training

The training consisted of part-and whole-task practice for the first thirty minutes of the first

day. Subjects either received practice controlling or monitoring each component task, then

received practice with the whole system, first with performance feedback, then without.

After the practice session subjects were instructed to ask the experimenter if they had any

questions about the task, and all questions were answered. Subjects were also given ten

additional minutes of monitoring training at the beginning of the second day. All subjects

saw both auto-pilot types. Subjects were told that during the experiment they would be

exposed to both auto-pilot types which they had seen in practice.

Considerable emphasis was put on how the system operated in terms of its “structure and

processes” (Kieras & Boviar, 1984). The mechanics of the system were explicitly explained

53
(e.g., “Pump P1 controls fuel flow from the left wing tank to the front fuselage tank.”), the

subtleties of system behavior were explained (e.g., “Pump P1 has twice the fuel pumping

capacity as pump P3.”), and the concept of the system was explicitly explained (e.g.,

“airplanes are sensitive to the location of weight therefore making it important that fuel be

equally distributed in the fuel tanks.”).

This was done so that subjects developed a complete mental model of the system, as

emphasized in research comparing operators with and without mental, or “device,” models

(Kieras & Boviar, 1984). Although the training received by controllers and monitors was

different in the specific level of control, every effort was made to insure that all other

elements of the training (e.g., training time and level of explanation of the dynamic system)

were identical.

Results

Between- and within-subjects comparisons were made using signaled failure reaction time

(RT) and a combined RT and error rate measure for inferential failures. Analyses of

variance (ANOVA) were used to test for group differences and interactions for both

signaled and inferential failures. The combined performance measure for inferred RT was

used for the purpose of managing between-subjects variability common with complex

dynamic task performance (Parasuraman, 1986). Further, as discussed in the next section,

its use was based on precedent (Kessel & Wickens, 1982; Wickens & Kessel, 1979-1981)

and on theoretical grounds (Gai & Curry, 1976).

The use of an efficiency index is based on the assumption that, “subjects aggregate evidence

over time concerning the discrepancy between the sampled-system behavior and the

internal model of a non-failed system, until this evidence exceeded an internal decision

54
criterion. Detection efficiency is reflected in the rate of aggregation of internal evidence,

independent of the criterion setting,” (Wickens & Kessel, 1980, p569). Therefore, because

detection efficiency is both fast and accurate, it should be reflected in an index integrating

both measures.

Although indexes used in earlier research have been described as “somewhat arbitrary,”

(Wickens & Kessel, 1980, p569), every effort was made here to remove the arbitrary nature

of the weighting method, while still combining the measures and reducing overall RT

variability. Therefore, it was decided that a weighting scale would be used in which RT

was divided into either fast (>8000ms) or slow (<7999ms), since eight seconds was very

close to the grand mean for the experiment. “Fast” RTs were scored as 1, RTs which were

“slow” were scored as 2, while misses were scored as 3. This created an ascending

combined RT/error rate scale in which optimal performance generated a 1 (every failure is

detected in less than 8 seconds), and bad performance received a 3 (every failure event was

missed). By using only three consecutive levels in the index, misses were appropriately

weighted as significantly worse than long “hits.” While the 16 second “hit” window is

somewhat arbitrary, it was considered acceptable because inferred failure detection in a real

task would be highly task and failure dependent. Further, detection performance in this

task is based on a continuum, so the actual window size is not particularly meaningful.

However, if this paradigm were based on a real operational task with real failures, this “hit”

window would take on considerable meaning. It was believed that this index successfully

reduced variability, yet was far less arbitrary than other weighting methods. The raw RT

and error rate data are provided in Appendix B.

55
Signaled Failures

Simple RT findings were generally contrary to expectation (see Figure 2.) The yoked

monitoring group was marginally faster at detecting simple failures than was the controller

group (974 vs. 1172 ms) when compared directly [F(1,24) = 3.3, p < .1] in Session 1. The

optimized auto-pilot group was not significantly faster than the controllers (1120 vs. 1172

ms) nor significantly different from the yoked group (1120 vs. 974 ms).

In the transfer condition (Session 2), the yoked group was marginally faster than the

controllers (873 vs. 1103), [F(1,24) = 2.01, p < .15]. Although this effect is weak, it is

reported because it is highly contrary to expectation. The auto-pilot group was also quicker

than the controllers (939 vs. 1031 ms), although this effect was not significant. The

difference between the yoked group and the optimized auto-pilot group (873 vs. 939 ms)

was also not significant. (See Figure 2.)

1200
1100
Signaled Failure RT

1000
900
Controllers
800
700 Auto-pilot
600 Yoked
500
400
300
200
Day 1 Day 2 Auto-pilot Day 2 Yoked

Figure 2. Experiment 1, Signaled Failure RT.

56
Inferential Failures

Figure 3 shows inferential failure detection results. Controllers were significantly better at

detecting inferential failures than the optimized auto-pilot group in Session 1, (2.535 vs.

2.676), [F(1,26) = 4.43, p < .05], but not significantly better than the Yoked group, (2.535

vs. 2.65). The optimized Auto-pilot group was not significantly different from the Yoked

group, (2.676 vs. 2.65).

In the transfer condition, in which all subjects performed both of the auto-pilot tasks,

controllers did not perform significantly better than either of the auto-pilot groups, although

all means were in the anticipated direction. The Controllers, when compared to the

optimized Auto-pilot group, were not significantly different (2.619 vs. 2.699), nor were the

Controllers different from the Yoked group when compared on the yoking task, (2.577 vs.

2.593), [F(1,25) = .58]. Interestingly, the Yoked group was better than the Auto-pilot

group at the Auto-Pilot task in the transfer condition (2.66 vs. 2.7) although this difference

was not significant. (See Figure 3.)

2.7
Inferred Failure Detection

2.65

2.6 Controllers
Auto-pilot
2.55
Yoked
2.5

2.45
Day 1 Day 2 Auto-pilot Day 2 Yoked

57
Figure 3. Session 2 Inferential Failure detection performance (combined index).

Discussion

The results of Experiment 1 only partially supported the experimental hypotheses, yielding

both expected, and unexpected findings. The finding most consistent with previous

research was that subjects who controlled were significantly better at detecting inferential

failures than were the Auto-pilot monitors, and marginally better than Yoked monitors in

Session 1. This finding, although consistent with past research showing that controllers,

when compared directly with monitors, are better at detecting failures, was not entirely

predicted from the hypothesis given that the higher workload levels present when

controlling could have interfered with failure detection. In this particular task, the

proprioceptive feedback was limited to information pertaining to fuel consumption, thus

only indirectly related to failures. In the experiments using tracking tasks, however,

proprioceptive feedback was a direct indication of system failure and therefore a highly

salient cue. Proprioceptive feedback is therefore not considered a distinct advantage for

controllers in this task.

Subjects could “hypothesis test” in a failure condition as in past research using tracking

tasks, and this may have been a distinct advantage for Controllers. When Controllers

sensed illogical system behavior, they could test their hypothesis through pump or throttle

manipulation to see if their own inputs resulted in continued illogical system behavior. Post

experiment interviews indicated that Controllers took advantage of “hypothesis testing”

when detecting failures. In addition, although the Controllers’ overall failure detection

performance was marginally better than the Yoked group, there was a slight and

nonsignificant speed accuracy trade-off in Session 1 (see Appendix B) which may have

58
been a result of Controllers taking the extra time to hypothesis test prior to signaling a

failure, causing their RT to be slightly greater and their accuracy significantly better. It is

also possible that Controllers took advantage of a more activated mental model of the

system, and were thus more sensitive to illogical system behavior. However, when

comparing controllers to monitors, as in Session 1, it is difficult to determine the role of the

operator’s mental model activity given the other possible explanations for this advantage.

Reaction times for the signaled failures were generally consistent with the hypotheses.

Although the detection of signaled failures is hypothesized to be unrelated to mental model

activity in this experimental paradigm, it is likely that signaled failures are an effective

measure of workload. In fact, signaled failure RT was marginally faster for Yoked

monitors than Controllers, probably reflecting the lower workload levels for the Yoked

group. It is also possible that the greater latency for Controllers may have been the result of

subjects’ need to scan a greater portion of the display in order to perform the sub-task of

matching the throttle with the recommended level. Thus, it is possible that the shorter RTs

of the Yoked group were because subjects spent more time focused in the center of the

display where both failure types occurred, rather than switching their focal point to the

throttle display area on the periphery of the display. Although the Auto-pilot monitors also

had lower workload levels than the Controllers (and perhaps even lower than the Yoked

monitors) their signaled failure RTs were not significantly slower than the Controllers.

Although somewhat surprising, it is consistent with a pattern of marginal performance in

all conditions associated with the auto-pilot monitoring task. Although not central to this

research, the weak auto-pilot performance will be discussed further in the following

paragraphs.

59
Results from Session 2 did not generally support the hypothesis that system controllers are

better monitors than subjects who monitored in Session 1. Although group means were in

the predicted direction, there were no significant differences between the Controllers and

the Auto-pilot or Yoked monitors. Controllers were slightly better than the Auto-pilot

group when transferring to the yoked condition, but this difference was marginally

significant at best (p < .13). Further, this result is more likely a result of very poor

performance by the Auto-pilot group in the Yoked condition, rather than good performance

by the Controllers. The only significant finding for Session 2 was that the Yoked group

performed better than the Auto-pilot group when transferring to the yoked condition. This

finding was not surprising given that the Yoked group had previous experience with the

yoked condition and the Auto-pilot group did not. However, one would expect the opposite

results in the auto-pilot condition.

There are several possible explanations for the Controllers’ failure to perform significantly

better on inferred failure detection tasks than the two monitoring groups in Session 2.

While past tracking task research used two days for Session 1, not including training

(Kessel & Wickens, 1982; Wickens & Kessel, 1979-1981; Young, 1995), I believed that the

cognitive nature of this paradigm would allow it to be learned more quickly than the subtle

motor skills required in a difficult tracking task. This assumption was incorrect, however,

as post-experiment interviews and experimental data suggested that the task was actually

quite difficult to learn and perfect, and that the training time had not been sufficient for

subjects to master the task. In fact, some subjects suggested that they were still learning the

task well into Session 2. Further complicating this picture is the fact that the high

workload in the controlling condition (in both training and during Session 1) may have

made it more difficult for Controllers to learn the task as compared to the two monitoring

groups. Given that the group means were in the predicted directions, it is possible that the

60
Controllers did have an advantage in detecting inferential failures but, because of the

learning issues, this difference was not strong enough to generate a significant effect.

The experimental hypotheses stated that there would be no effect of prior experience for

signaled failure reaction times in Session 2. This prediction was based on the theory that

any advantage during monitoring afforded to past controllers was due to a more activated

mental model, and thus would not affect signaled failure detection performance. However,

RTs for signaled failures were marginally affected by condition. The Yoked monitor group

was faster (at a marginally significant level) than the Controller group while performing the

yoked monitoring task. The Auto-pilot monitors were also faster than the Controllers,

although not significantly. Although the difference between Controllers and Yoked

monitors was not significant, it is unexpected and therefore quite interesting, and will be

explored further in the subsequent experiment. The most salient explanation for this

finding seems to be that Controllers scan the display more diligently than the Yoked

monitors, and therefore spend less time focused on the center of the display where the

signaled failures occur.

The theory that system controllers scan more effectively is further supported by the fact that

Controllers may have performed better on the inferential failure detection task. This

improved failure detection performance could have been the result of Controllers

integrating subtle cues from the system more effectively and therefore being more sensitive

to system abnormalities. Importantly, this integrating process would likely use information

from the throttle display in forming a diagnosis. It therefore seems that if Controllers are

more sensitive to the system operations as an integrated unit, they spend more time focusing

on the throttle display accessing the important throttle information and less time focused on

the center of the display.

61
Implications of this finding are that subjects who have controlled, and are likewise

benefiting from controlling experience while monitoring, seem to be spending more time

scanning the display for useful information. Given that the throttle provides subtle clues

about system behavior, the Controllers should have derived a failure detection advantage if

they were allocating more time to studying its impact on the system. However, the

Controllers were not significantly better on the inferential failure detection task than the

other groups. This implies that the information provided by the throttle was not valuable

enough to improve inferential failure detection performance for those who observed it. It is

possible that while throttle information may have been advantageously used by Controllers,

other factors offset this advantage.

This explanation may provide some clues as to why accidents involving controlled flight

into terrain with auto-pilot engaged were not detected by the pilots even though there was

ample evidence of impending disaster. This finding also suggests that if one of the benefits

of controlling is better scanning performance, measurement of eye location and movement

may be diagnostic of the effects of the “out of the loop” performance problem suggested by

Smolensky (1993). It is also possible that Controllers, because of their extensive experience

with controlling the throttle, gained a greater understanding of the relationship between the

throttle and fuel system behavior, and were hence more inclined to observe throttle activity

even when monitoring the system. This explanation supports the contention that active

controllers scan more while monitoring because they have developed a different strategy,

either implicitly or explicitly, for detecting inferential failures.

As mentioned previously, two different monitoring groups were used to address concerns

that the experimentally superior “yoked” auto-pilot method may induce differences

62
compared to an “optimized” type auto-pilot. However, even the optimized automation is

completely task dependent. The optimized system for this task was based on the view that

aircraft automation is extremely consistent and rigid in the way it controls the various

systems. Therefore, the optimized system for this task consistently held fuel levels in the

“safe” zones, and operated the pumps in a rigid operational sequence to maintain correct

fuel levels. Additionally, the throttle setting was automatically maintained at the level of

the recommended throttle position.

Results from the experiment suggest that the hypothesized effects are not unique to the

“yoked” condition. In fact, in nearly all conditions the performance of the “optimized”

Auto-pilot group was worse than the Yoked group. This suggests that the effects found in

this paradigm are not due to the use of the yoked methodology. In fact, results obtained

using this method may underestimate effects found in applied settings due to the prevalence

of “optimized” automation in operational systems. Since this difference is not the focus of

this research, nor was it consistently significant, it will not be explored further. However, it

may be worth noting that the highly consistent behavior of the optimized system may have

had a numbing effect on subjects, thus pushing them even farther out of the control loop, or

so they perceived, and reducing mental model activation even further. In addition, it is also

possible that the consistency of the automation made them believe the task was easier,

compared to the Yoked monitors, who had to pay close attention to the automated system in

order to know what it was doing.

This view is supported by the fact that in Session 2 the Auto-pilot group had marginally

poorer performance on inferred failure detection tasks in the yoked condition, compared to

Controllers, and significantly worse performance than the Yoked monitors. Thus, when

transferring to the more difficult yoked monitoring task, the optimal Auto-pilot group was

63
at a further disadvantage as a result of their experience in the highly predictable optimized

auto-pilot condition. Further, there are indications that the Yoked group performed better

at the optimized auto-pilot monitoring task than the Auto-pilot group, even though this

condition was novel to them. This suggests that some feature of the yoked monitoring task

made Yoked monitors more sensitive to system behavior, thus giving them an advantage

over the optimized group even in the optimized auto-pilot condition.

Experiment 1 Conclusions

The results of this experiment were informative and suggestive. Controllers, when

compared to both monitoring groups in Session 1, seemed to be better at detecting

inferential failures. This finding likely reflected “hypothesis testing” and perhaps improved

mental model activation by Controllers. Signaled failures, however, produced generally

opposite results reflecting the higher workload of controlling, and the necessity of

Controllers to observe the throttle for control purposes resulting in less time spent focused

in the center of the display.

inferential failure detection means were in the predicted directions during Session 2,

although these differences were not significant. This finding suggests that a transfer effect

for Controllers may exist, but the experiment as conducted lacked power. The results of

signaled failure performance in Session 2 were surprising, perhaps reflecting the fact that

subjects with experience controlling scan the display more effectively, supporting previous

contentions that mental models play a role in guiding perceptual activity (Endsley, 1995).

However, this difference could also be due to Controllers developing a failure detection

strategy more dependent on the effect of throttle behavior. In either case, Controllers seem

64
to spend more time focused on the throttle while monitoring, and less time focused on the

center of the display.

Implications for Experiment 2

The primary goal of Experiment 1 was to replicate earlier findings that controllers of a

dynamic system are better at detecting system failures than subjects who only monitor when

both groups transfer to a monitoring condition. The intended contribution of Experiment 1

was to replicate these findings using a cognitively complex dynamic system management

paradigm. Although the experiment yielded interesting results, the primary objective was

not achieved. I believe this failure was a result of an insufficient manipulation of

conditions, not the result of non-existent effects.

Experiment 2 was designed to both correct the weaknesses of Experiment 1, and to further

explore the surprising findings from the signaled failure detection task. In addition, the

“optimized” auto-pilot condition was dropped from Experiment 2, since the predicted effect

seems to be present using both auto-pilot types and the yoking methodology is

experimentally superior.

A major difference between Experiment 1 and Experiment 2 is the inclusion of an

additional day for Session 1. The extra day was added to address the anecdotal and

experimental evidence suggesting that subjects were still learning the task well into the

second day (Session 2). I am interested in the transfer effects from a well learned task, and

it is therefore imperative that the task be well learned before subjects switch to the transfer

task. In addition, some subjects suggested in post-experiment interviews that they were

65
confused by the triggering system and the consequent lack of performance feedback, and

that this confusion further hampered their ability to quickly learn the task. Subjects

indicated that because of the subtle nature of the inferential failures, even though the failure

behavior ceased after detection, it wasn’t always clear if a failure had been successfully

signaled. This confusion was exacerbated by the fact that a “false alarm” deactivated the

trigger, so that when subjects positively identified a failure in the same trial, trigger

activation had no effect, leading them to believe that they had improperly diagnosed the

failure.

To address this confusion, a message system was added to the display to inform subjects of

both the state of the trigger (armed or not) and whether or not they had correctly identified

a failure. Not only did this procedural change augment the performance information that

subjects generally assumed on their own but, more importantly, prevented any false

learning resulting from system state misinterpretation. Although this feedback could be

criticized on the grounds that better performers would receive more positive feedback, this

method afforded users the opportunity to learn from both correct and incorrect performance.

Although there was implicit feedback in Experiment 1, it favored individuals with better

performance to an even greater degree, since good failure detection performance likely

meant better system understanding. Therefore, improved system understanding not only

led to better performance, but also more accurate interpretation of implicit system feedback.

The intentional system subtleties included in Experiment 1 were carried over into

Experiment 2, but were exaggerated somewhat to further occlude the inferential failures.

Failure onset was made more subtle, and pump flow-rate differences were exaggerated

slightly. Most importantly, the non-linear relationship between the throttle level and the

rate of fuel consumption was exaggerated, and the recommended throttle level changed

66
positions at a greater frequency, making throttle level monitoring (and controlling) more

demanding. This procedure was used because in Experiment 1 Controllers may have been

spending more time scanning the throttle display while monitoring. If knowledge of

throttle activity is now more important for inferential failure diagnoses, then scanning

behavior by Controllers should result in greater positive impact on their inferential

detection performance.

To test the hypothesis that Controllers may have poorer signaled failure detection

performance because they spend more time scanning the display for throttle information,

the throttle information was both moved farther to the edge of the display and made slightly

less salient. Both changes were made to increase the time required to effectively scan the

throttle information. This change should exaggerate the signaled failure detection

differences seen in Session 2 of Experiment 1, if scanning strategies were responsible for

this difference.

To further explore this issue, the throttle was removed from the display on half of the trials

in Session 2. If Controllers’ poorer signaled failure detection performance was due to

differences in scanning behavior, then their signaled failure detection performance should

improve, mirroring that of monitors when there is no throttle present. Likewise, if

Controllers are using the throttle information to facilitate inferred failure detection, then the

removal of this information should reduce their inferential failure detection advantage.

67
Experimental Hypotheses, Experiment 2

Experiment 2 was designed to correct the shortcomings of Experiment 1 and as a tool to

further explore unexpected findings from Experiment 1. I expect the direct comparison

between Controllers and Monitors in Session 1 to again show a small advantage for

Controllers in the inferential monitoring task, as a result of hypothesis testing and perhaps

improved mental model activation, but a disadvantage in the signaled failure detection task

due to higher workload and the need to spend more time focused on the throttle display.

Controllers should also show an advantage over Monitors in the inferential failure detection

task in Session 2, supporting the hypothesis that the heightened activation of the

controllers’ mental models makes them more sensitive to inferential failures when

transferring to the monitoring task.

However, this advantage in inferential failure detection may only be present in the “throttle

visible” condition (see Method). If the activated mental model guides perception (Endsley,

1995), and attention is thus directed to the throttle information on the display because it

provides relevant data for inferring abnormal operation, the absence of throttle information

should impair the failure detection advantage of Controllers. Further, if the poor signaled

failure detection performance in Session 2 by Controllers in Experiment 1 was a result of

their scanning behavior, then the “throttle not visible” condition in Experiment 2 will show

an improvement in signaled failure detection performance since effective scanning will no

longer include items in the periphery of the display.

68
EXPERIMENT 2

Method

The Methods section for Experiment 2 highlights only differences from Experiment 1.

Subjects

Thirty-eight volunteer, right-handed, male university students from Introductory

Psychology courses were used in the experiment. Students received “experimental credit”

and were paid a base rate for their participation in the experiment. Additionally, subjects

were given the opportunity to earn a five dollar bonus for good performance.

Task

The task for Experiment 2 was the same as that used for Experiment 1 except for the

following changes: A trigger and performance feedback message was added to the lower

right corner of the display to address the confusion about system state expressed by subjects

in Experiment 1. “Trigger Armed,” “False Alarm, trigger INOP until reset,” “Failure

Detected_system resetting,” and “Miss_system resetting.” messages appeared in accordance

with the system state. In addition, the messages were color coded to heighten awareness of

changes in the system state.

Two changes were made to the throttle portion of the display. First, the throttle was moved

farther toward the upper right hand corned of the display and the “recommended throttle

position” was made less salient by decreasing the width of the indicator bar. Both of these

changes were made to increase the time required to scan the throttle-setting portion of the

69
display. In a related change, the digital aircraft speed was moved from the forward-center

location of the aircraft to the upper left-hand corner of the display. This was done to further

increase the time needed to effectively scan all information components of the display. The

second major change was the removal of all throttle information on half of Session 2 trials.

This change eliminated the need for subjects to scan the periphery of the display, but also

removed information which may have helped them in the inferential failure detection

process.

In order to further occlude normal system operation and thus complicate the inferential

failure detection process, individual pump flow rate differences were exaggerated, the

linearity of the throttle level/fuel flow ratio was degraded, and inferential failures

themselves were made slightly harder to detect. The final change to the task for

Experiment 2 was that the time given to detect a pump failure was reduced from 5 seconds

to 3.5 seconds because the results of Experiment 1 suggested that the extra 1.5 seconds was

unnecessary.

Experimental Design

Two groups participated in the transfer of training, between-subjects design. The first

group controlled the system during the first and second days of the experiment (Session 1)

while the second group monitored a “yoked” auto-pilot during Session 1. On the third day

(Session 2), the transfer condition, both groups monitored the yoked condition and detected

failures. However, in two of the four trials, the throttle information was eliminated from

the display. (See Figure 4.)

70
Participatory Mode, Experiment II

Session 1 Session 2
Day 1,2 Day 3, (Monitoring)

Throt Vis
Controllers
Control
Throt NotVis

Throt Vis
"Yoked" monitors
Monitor
Throt NotVis

Figure 4. Experimental design, Experiment 2, Session 2 counterbalanced by condition.

Training

The training session was 30 minutes at the beginning of Day 1, and was identical to

Experiment 1 except that subjects received performance feedback throughout training.

Results

Between- and within-subject comparisons were made for both signaled failure RT and the

combined RT and error-rate measure for inferential failures used in Experiment 1. An

analysis of variance (ANOVA) was used to test for group differences and interactions.

Session 1 data were from Day 2 only unless otherwise specified, as Day 1 was treated as

learning. The raw RT and error rate data for Inferred failures are provided in Appendix C.

71
Signaled Failures, Session 1

Signaled RT findings generally supported the experimental hypotheses. As suggested by

the results of Experiment 1, subjects were still learning the task into the second day, as

demonstrated by the significant mean reaction time group effect from Day 1 to Day 2,

[F(1,36) = 13.6, p < .01]. Although the Group by Day interaction was not significant,

Controllers’ improvement was larger from Day 1 to Day 2 in Session 1 (960 vs. 835),

[F(1,17) = 14.45, p < .01], than Monitors (867 vs. 794), [F(1,19) = 3.09, p < .1]. Simple

comparison between Controllers and Monitors in Session 1 (Day 2) was in the predicted

direction but was not significant (835 vs. 794), [F(1,37) = .27].

Signaled Failures, Session 2

Session 2 yielded surprising findings for signaled failures. There was a main effect

favoring Controllers over Monitors, [F(1,36) = 4.75, p < .05], and as shown in Figure 5, a

marginally significant Group by Condition (Visibility) interaction [F(1,36) = 2.39, p <

.15]. There were no significant group differences in the throttle Visible condition (667 vs.

728), [F(1,37) = 1.21], but there was a significant difference in the throttle NotVisible

condition (604 vs. 757), [F(1,37) = 6.62 , p < .05]. As expected, Controllers improved

from the throttle Visible to the throttle NotVisible condition (667 vs. 604), [F(1,17) = 6.64,

p < .05], while the Monitor’s mean RT increased, but not significantly (728 vs. 757). (See

Figure 5.)

72
1000

Signaled Failure RT
900
800
700 Controllers
600
500 Monitors
400
300
200
Day 1 Session 1, Session 2 Session 2
Day 2 Visible NotVisible

Figure 5. Session 2 Signaled Failure RT.

Inferential Failures, Session 1

Inferential failure detection performance strongly supported the experimental hypotheses.

There was a significant effect for Day in Session 1 favoring Day 2 [F(1,36) = 15.2, p <

.01] with no significant Day by Group interaction, supporting the contention that both

groups were still learning the task after the first day. This is also supported by the false

alarm data which showed a significant reduction by day, [F(1,36) = 20.4, p < .01], and no

interaction.

Controllers had a lower mean (better performance) than did Monitors in Session 1 (Day 2),

but it was not significant [F(1,37) = .18]. As in Experiment 1, there was a slight non-

significant speed/accuracy trade-off in this condition (see Appendix B), favoring better

accuracy for Controllers, but slightly slower reaction time.

Inferential Failures, Session 2

Controllers in Session 2 had significantly better inferential failure detection performance

than monitors, but only in the throttle Visible condition (1.9 vs. 2.15), [F(1,37) = 4.19, p <

73
.05]. The mean performance score for Controllers was better than the Monitors, but not

significantly (2.01 vs. 2.04), [F(1,37) = .1]. Although there was no group effect favoring

Controllers over Monitors by condition, there was a marginally significant interaction (see

Figure 6), [F(1,36) = 3.67, p < .1], resulting from Controllers having poorer performance

in the throttle NotVisible condition compared to the throttle Visible condition (1.9 vs.

2.01), [F(1,17) = 2.35, p < .15], while Monitors performed better in the throttle NotVisible

condition, although this difference was not significant (2.15 vs. 2.04), [F(1,19) = 1.55].

2.6
Inferred Failure Index

2.4
2.2
Controllers
2
Monitors
1.8
1.6
1.4
Day 1 Session 1, Session 2 Session 2
Day 2 Visible NotVisible

Figure 6. Session 2 Inferential Failure detection performance (combined index).

Discussion

The results of Experiment 2 strongly support the experimental hypotheses, with few

surprises. Improvements in signaled and inferential failure detection performance from

Day 1 to Day 2 in Session 1 support the belief that Experiment 1 subjects either had not

74
learned the task or were not proficient at the task by the end of Day 1. Further, these data

may underestimate the proficiency deficiencies of subjects after Day 1 in Experiment 1,

given the additional feedback provided to subjects in Experiment 2 which likely facilitated

task acquisition.

Signaled failure RT differences in Session 1 of Experiment 1 were marginally significant,

but they were not significant in Experiment 2, although means in both experiments were in

the same direction. This may be a reflection of the fact that by Day 2 workload levels were

probably more similar between the two groups than in Experiment 1, as the additional

practice afforded by Day 1 may have reduced the workload levels for Controllers on Day 2.

This contention is based on the premise that workload in Session 1 in Experiment 1 was a

result of both having to learn to detect failures and learn how to control the system, in

addition to manually controlling the system, the latter two tasks not being applicable to

system monitors. However, in Experiment 2, much of the learning had already taken place,

leaving workload differences between the two groups a result only of the need to manually

control the system for Controllers.

Session 2 signaled failure detection performance supported the hypothesis that Controllers

scan the display more effectively than do the Monitors. Two features of the signaled failure

detection performance support this contention. First, and most importantly, is the fact that

there is a significant difference for Controllers between the throttle Visible and throttle

NotVisible condition, yet there is no such difference for Monitors. This is supported by

both the within-subjects’ comparisons and the marginally significant group interaction of

throttle visibility. This finding suggests that the Controllers scanned the peripherally-

located throttle information to facilitate inferential failure detection when the throttle was

present on the display. This scanning of the throttle information necessarily meant a cost in

75
RT for detecting signaled failures. Thus, in the throttle NotVisible condition, the

Controllers did not have the option of scanning the peripherally located throttle

information, and their signaled failure RT decreased significantly because attention was

focused only in the center of the display. Likewise, the Monitors signaled failure detection

performance remained unaffected by the presence or absence of throttle information,

suggesting that there was little, if any attention allocated to it when it was present.

Surprisingly, there was a significant group effect for signaled failure RT, and a marginally

significant interaction. This finding is contrary to the marginal effect found in Experiment

1 in which Controllers were slower than Yoked monitors. However, in the equivalent

throttle Visible condition in Experiment 2, there was no significant difference. This leads

to the speculation that the Experiment 1 finding was a statistical artifact. However, in the

throttle NotVisible condition, the Controllers were significantly faster than the Monitors,

leading to the significant group effect. This finding is contrary to expectations, as the

hypothesis was that these two groups should have performed similarly on the signaled

failure detection task, as both groups were focused similarly in the center of the display.

Although this finding is problematic for the hypothesis that the controller advantage while

monitoring is due to a higher activation state of the subject’s mental model, there are two

likely explanations which are consistent with the theory. The first is that Controllers

benefit from a more activated mental model of the system and that this activation not only

enhances their ability to perceive, integrate and analyze features of the task with greater

efficiency, but spills over such that even simple stimuli are perceived and responded to

more efficiently. The second possible, but less likely, explanation is that Controllers were

frustrated by the lack of throttle information in the throttle NotVisible condition and were

thus channeling extra effort into the task. While this extra effort did little to enhance

76
inferential failure detection, it did result in significantly better signaled failure detection

performance.

The pattern of outcomes for inferential failure detection conformed to the experimental

hypotheses. As with signaled failure detection performance, there was a significant effect

of Day in Session 1, with no group interaction. This reflected the fact that both Controllers

and Monitors were still learning the task into the second day. In Day 2 of Session 1, mean

performance for Controllers was better than the Monitors, but this difference was not

significant. This finding reflects a consistent trend in this paradigm that when compared

directly, there is little or no advantage for Controllers. Similar to Experiment 1, Session 1

inferential failure detection performance yielded a marginal speed-accuracy tradeoff, with

accuracy in favor of Controllers. This is likely a result of Controllers taking the time to

manipulate the system in order to “hypothesis test.” While hypothesis testing generated

more accurate performance, there was some cost in RT. However, none of these differences

(reaction time or accuracy) were significant when compared directly. The fact that both

Experiment 1 and Experiment 2 generated the same trade-off in Session 1 implies that this

is a true effect. Further, Wickens and Kessel (1979) found the same trade-off when

comparing monitors and controllers directly. In addition, it is important to note that

Controllers had slightly higher workload than Monitors in Session 1, which seems not to

have an effect on the Controllers’ inferential failure detection performance. This finding

further supports Wickens and Kessel (1980) who found that workload resulting from

manual response organization and execution (e.g., manual tracking) may not compete with

resources allocated to perceptual encoding and memory.

While results from tracking-task experiments suggest that the proprioceptive feedback from

tracking improves performance for Controllers, such direct feedback about system behavior,

77
specifically system failures, was not available proprioceptively to Controllers in the fuel

management paradigm. However, response related information might result from the act of

controlling the throttle and manipulating the fuel pumps, thus instantiating the “state” of

the system for Controllers. While this response information is certainly not as diagnostic

about system state as the proprioceptive feedback from tracking, it may serve a similar role

in updating the operator’s mental model of system activity (i.e., the dynamic execution of

the operator’s mental model), thus off-setting any performance deficits due to higher

workload. There are thus two non-competing explanations for the lack of effect of higher

workload for controllers. Either the resources required to control the system are different

than those required to detect subtle inferential failures, or, information obtained or

reinforced from the act of controlling made the task of detecting failures easier, and

therefore more resource efficient even though the resources were the same.

Results from Session 2 supported the experimental hypotheses and successfully replicated

previous findings of Controller superiority in the monitoring task. In the throttle Visible

condition, the Controllers had significantly better inferential failure detection performance

than Monitors. While this finding supports the hypothesis that controlling a system causes

one to be a more effective monitor of inferential failures, it is made more diagnostic by the

fact that no such advantage for Controllers exists in the throttle NotVisible condition.

There was no significant difference between Controllers and Monitors in the throttle

NotVisible condition, and the ability of Controllers declined slightly from the Visible to the

NotVisible condition. There was also a marginal Group by Visibility interaction, reflecting

a worsening of performance for Controllers, but a non-significant improvement in

performance for Monitors from the Visible to NotVisible condition.

78
While the intent of the throttle visibility manipulation was to affect signaled failure

detection performance, which it did, it was unknown whether the removal of throttle

information would actually hinder performance. In theory, the throttle display provides

information which is useful, but not critical, in diagnosing inferential failures. However,

data from Experiment 1 seemed to indicate that while Controllers were focusing more on

throttle information than Monitors, it did little to help them in detecting inferential failures.

However, in Experiment 2 the throttle mechanism was altered to make it a more valuable

information component in the detection of inferential failures. It appears that this change,

in combination with the increased proficiency gained from the additional day in Session 1,

caused individuals who focused more on the throttle information to have a distinct

advantage in detecting inferential failures.

Importantly, the effect of throttle visibility suggests that scanning the throttle information

was the critical behavior that enhanced Controller performance. This finding is easily

interpreted through Endsley’s (1995) view of the role of the well developed, or highly

activated, mental model of the behavior of a particular system. Endsley (1995, p.43)

suggests that this model, “provides (a) knowledge of the relevant elements of the system

that can be used in directing attention and classifying information in the perception process,

(b) a means of integrating the elements to form an understanding of their meaning, and (c)

a mechanism for projecting future states of the system based on its current state and an

understanding of its dynamics.” Viewed in the context of the current dynamic execution

theory of mental models, these data can be interpreted to suggest that the Controllers,

because of their activated mental model, direct attention to the throttle mechanism, given its

diagnostic importance in detecting failures, and then successfully integrate that perceptual

information with other momentary system attributes to successfully detect failures. When

the throttle information is not visible, this perceptual and computational advantage goes

79
unused, as is indicated by the non-significant performance difference between Controllers

and Monitors in the throttle NotVisible condition. Although Controllers may have had

some advantage, as seen in the mean difference favoring Controllers, this advantage was

too slight to be significant.

Conclusions and Implications

Experiment 2 successfully supported the hypothesis and replicated findings that controllers

are better at detecting failures when transferring to a monitoring task than subjects who

monitor in both conditions. Further, the hypothesis that controllers may scan the display

more in an attempt to perceive task-relevant stimuli was also supported by the fact that

controllers in this experiment were significantly poorer at detecting signaled, centrally

located failures when relevant system information was present in the periphery of the

display. In addition, it appears that the Controllers not only scanned the display for

information, but they perceived and integrated it more efficiently than monitors and were

thus more effective at detecting inferential failures. The only surprise was that Controllers

were, on whole, better at detecting signaled failures than were system monitors, suggesting

that there may be some carry-over effect from an activated mental model which is only

indirectly related to the signaled failures.

Several practical and theoretical implications can be drawn from these findings. Most

importantly, the transfer advantage of controllers over monitors was replicated using a more

realistic, cognitively complex dynamic task. The similarity of this paradigm to other

dynamic systems, and the convergence of these data with past findings supports the

contention that experience controlling a system (being “in the loop”) provides advantages to

operators when they must passively monitor the system. These findings also suggest that

80
controlling the system may make monitors more sensitive to system variability, and

especially to those features of the system which were controlled in the past. This strongly

supports concerns by Moray (1986) that there may be serious consequences when operators

learn to monitor a system without ever having controlled the system. Perhaps, in such

learning environments, the relationships between system variables are simply not

understood or appreciated to the same degree as when one must manually control system

variables. This is especially significant, given the suggestion that pilots transitioning into

highly automated aircraft have little opportunity to acquire or practice manual flying skills

in the aircraft (Orlady & Wheeler, 1989).

I predicted that signaled failure detection in Session 2 should remain unaffected by

manipulations except for the Controllers in the throttle Visible condition. The data,

however, showed that Controllers across the Visibility condition were significantly faster

than monitors, with the most significant difference being in the throttle NotVisible

condition. While this can be interpreted in a manner which does not contradict the

hypothesis, it may be viewed as somewhat problematic for a hypothesis that states that the

controller advantage is a result of an activated mental model of the underlying system

behavior. As mentioned previously, however, it is possible that this improved signaled

failure detection performance is a vigilance carry-over effect from an activated mental

model. This would imply that a well-activated mental model not only guides perception to

critical features of that system, but it may also affect perceptual sensitivity to features

generally unrelated to underlying system behavior.

This experimental design does not preclude the possibility that controllers and monitors

develop slightly different mental models of the dynamic system, despite every effort made

in training to prevent it. While the controller’s mental model obviously contained an actual

81
motor-control component, it was believed that both groups would likely develop the same

underlying operational understanding of the system, and thus the same mental model for

use in inferential failure detection. It is possible, however, that the act of controlling in

Session 1, either through a more active learning experience, or by the reinforcing of certain

system-variable relationships resulting from controlling those variables, may have caused

the development of slightly different mental models. While this does not exclude a mental

model activation theory, it does suggest that Controllers may have a more activated, but

different mental model at their disposal.

Although I believe that this experimental design is highly valid for operational

environments in which training departments have the choice of training future system

monitors either in an automatic-only or with a hands-on control methodology, it lacks

ecological validity in the current aviation context. All pilots of highly automated aircraft

have considerable hands-on flying experience, although it may be infrequent in day-to-day

commercial operations as suggested by Orlady and Wheeler (1989). Young (1969) and

Wickens and Kessel (1979) used a repeated measures design so that all subjects both

controlled and monitored. While this design generated failure detection performance

differences between system controllers and monitors, it was impossible to determine the

degree to which a more consistent internal model of the system contributed to the

controllers improved performance, as compared to other factors (e.g., proprioceptive

information). Kessel and Wickens (1982) thus employed a between-subjects, transfer of

training design to address the problem.

Given the success of the current experiment in replicating Kessel and Wickens (1982), I

feel that a return to a repeated measures design using this cognitive dynamic task would

offer several distinct advantages for answering additional questions generated by this

82
experiment. First, a repeated measures design controls for the large between-subjects

variability found both in this experiment and typically in complex vigilance tasks

(Parasuraman, 1986). More importantly, however, it insures that all subjects develop the

same mental model of the system. While this feature was problematic for Wickens and

Kessel (1979), the fact that proprioceptive feedback is not a direct indication of system

failure in the current paradigm makes this a less pervasive problem. Further, a repeated

measures design has more ecological validity, helping to answer the question of whether

temporarily controlling a cognitive dynamic system subsequently makes one a better

monitor, as indicated by Parasuraman, Mouloua, and Molloy (1996).

Experiment 3 Experimental Hypotheses

Experiment 3 uses the same dynamic fuel management task as in Experiment 2 but with a

repeated measures design. This design was altered so that all subjects were trained in the

controlling task and given sufficient time to become proficient at the task (two days in

addition to the training session, as in Experiment 2). Subjects then monitored and detected

failures for the next four days except for two trials on either days five or six. The subjects

failure-detection performance for both failure types was then compared for the two trials

following the controlling re-introduction to the same two trials after continued monitoring

on the non-controller re-introduction day.

As in the previous experiments, it is hypothesized that controlling the system would cause

improved failure detection performance for subsequent monitoring. Further, this

improvement should be primarily in inferential failure detection, as subtle system operation

is hypothesized to be more sensitive to the level of activation of the operator’s mental

model. Given the superior ecological validity of this design for aviation operations, the

83
hypothesized improvement of performance has strong implications for the value of

controller reintroduction for enhanced monitoring performance.

84
EXPERIMENT 3

Method

The Methods section for Experiment 3 highlights only differences from Experiment 2.

Subjects

Fifteen right-handed male university students were used in the experiment. Students were

paid a base hourly rate for their participation in the experiment. Additionally, subjects were

given the opportunity to earn a higher hourly rate for good performance.

Task

The task for Experiment 3 was the same as that used for Experiment 2 except for the

following change:

A message box was added to the lower left corner of the display informing subjects of the

participatory mode. The message stated either “Automatic control,” or “Manual control,”

and the messages were displayed in different colors to help alert subjects to any change in

participatory mode. In earlier experiments, participatory mode was described in the

training session prior to that day’s task, so no message system was necessary.

Experimental Design Considerations

A completely within subjects’ transfer of training design was used to address the large

between-subjects’ differences found typically in complex vigilance tasks (Parasuraman,

85
1986) and in the previous two experiments. All subjects learned the controlling task while

detecting both failure types, and proceeded to participate in the controlling mode for the

first two days. Subjects then spent the remaining four days in the monitoring mode, except

for the two 12 minute trials in which they were reintroduced to controlling.

Experiment 3 Pilot Study

Because of the potential confounding effects of trial and day in a within subject’s transfer of

training design, a pilot study for Experiment 3 was conducted to determine the best

sequence of conditions. The pilot study used four subjects who controlled the system and

detected failures for the first two days, then transferred to the monitoring mode on the Day

3. Subjects monitored the system and detected both failure types on Days 3 through 9.

Results of the Experiment 3 pilot study for trial effects showed a significant difference

between Trials 1 and 4 for inferred failures [F(1,4) = 9.9, p < .05], and a non-significant

difference in the same direction for signaled failures. There was a marginally significant

difference between Trials 3 and 4 for inferred failures [F(1,4) = 6.6, p < .1], but no

difference in means for signaled failures. There was no Trial by Day interaction indicating

a “by Day” stability for the observed trial effects. Importantly, failure detection

performance was stable in trials four and five for both inferred and signaled failures. (See

Figure 7.)

86
2.5
600
2
550
1.5 Signaled
500 Inf erred
1
450 0.5

400 0
1 2 3 4 5

Figure 7. Inferred and Signaled Failures by Trials, Days 4 - 7.

The pilot study results for Days revealed the typical trend of an improvement from Day 1 to

Day 2 (both controlling days) for both signaled and inferred failures as seen in Experiment

2. Further, on Day 3 (the first monitoring day), inferred failure detection performance

declined, while signaled performance increased. More importantly, however, both inferred

and signaled failure detection performance are stable by Day 4 and remained that way

through Day 7. (See Figure 8.)

800 2.4
2.2
700 2
600 1.8
1.6
Signaled
500 1.4
Inf erred
1.2
400 1
0.8
300
0.6
200 0.4
1 2 3 4 5 6 7 8 9

Figure 8. Inferred and Signaled Failures, Days 1 - 9.

87
Surprisingly, there was an improvement in inferred failure detection performance on Day 8,

without a concurrent improvement for signaled failures. While this initially appears

contrary to the hypothesis that monitors’ performances should decline after continued

monitoring, it more likely demonstrates a problematic feature of the experimental

paradigm. In a true operational environment operators would seldom see the same

“inferential” failure twice. Rather, an understanding of correct system operation would be

the basis for system failure diagnosis. But in this experiment, it appears that sensitivity to

one type of inferred system failure may become a factor after long duration interaction with

this system. By the beginning of the eighth day, system monitors had already observed 210

inferential failures, not including the training session on the first day. It is therefore quite

likely that the tremendous exposure to the inferential failures used in this paradigm actually

resulted in a subtle but distinct improvement in inferential failure detection performance as

monitors became sensitive to subtle system behavior, in essence “signaling” an inferred

failure.

Another explanation is that this improvement is due to subjects anticipating the end of the

experiment. However, this explanation is discounted because the increase occurred on Day

8, not Day 9, and there was actually a marginal decrease in performance from Day 8 to Day

9. Further, there was no concurrent performance increase for signaled failure detection on

Day 8.

Experiment 3 Design

Results from the pilot study revealed three significant design considerations for Experiment

3. First, given the stability of Trials 4 and 5 for both signaled and inferred failures, it was

determined that these trials would be the best for the between- and within-day comparisons.

88
This left the first 3 trials available for the requisite controller re-introduction. Secondly,

due to the stability in both signaled and inferred failure detection performance on Days 4

through 7, a six day experiment was chosen. This design allowed both sufficient

controlling experience by using Days 1 and 2 as controlling days, and also allowed a

sufficient continuous monitoring period of 2 days prior to controlling re-introduction.

Therefore, subjects controlled on either Day 5 or Day 6 (to counter-balance any potential

effect of Days) on Trials 2 and 3, and then monitored on Trials 4 and 5 (on either Day 5 or

Day 6 depending on the counter-balance). Comparisons were made between Trials 4 and 5

after controller re-introduction to Trials 4 and 5 after continuous monitoring. (See Figure

9.)

Participatory mode, Experiment 3

Day 1 Day 2 Day 3 Day 4 Day 5 Day 6

Training

Control Control Auto-Pilot Auto-Pilot Auto-Pilot Auto-Pilot

Control Control Auto-Pilot Auto-Pilot Control Auto-Pilot

Control Control Auto-Pilot Auto-Pilot Control Auto-Pilot

Control Control Auto-Pilot Auto-Pilot Auto-Pilot Auto-Pilot

Control Control Auto-Pilot Auto-Pilot Auto-Pilot Auto-Pilot

Figure 9. Experiment 3 experimental design, participatory mode, counter-balanced on

Days 5 and 6. Comparison trials outlined in bold.

Given that the results of the pilot study suggested that continued monitoring performance

might result in an increased sensitivity to the inferential failures in this paradigm (as seen

on Days 8 and 9), Experiment 3 was designed to minimize subjects’ exposure to inferred

89
failures. In addition to experimental issues, this consideration was warranted by the fact

that lower inferred failure exposure increased external validity. Therefore, although the

Day 1 failure rate remained the same as in Experiment 2 (six signaled/six inferred), the

number of inferred failures experienced by subjects was decreased in the remainder of the

experiment. On Days 2 through 4, subjects were exposed to four inferred failures on one

trial, two inferred failures on two trials, and zero on two trials. Signaled failures remained

constant for all trials to insure that subjects remained focused on the task even when no

inferred failures were present. (See Figure 10.)

On Days 5 and 6, the controller re-introduction/comparison days, subjects were exposed to

two inferred failures on Trial 1, zero inferred failures on Trials 2 and 3 (the controller re-

introduction trials) and six inferred failures on the comparison trials (Trails 4 and 5).

There were no inferred failures on Trials 2 and 3, as controller reintroduction in operational

environments would likely not expose operators to specific failures. In addition, having

subjects control the system without failures is a stronger test of the hypothesis that

controlling a system activates their dynamic model of the system, thus making them more

sensitive to abnormal system operation. Additionally, exposing subjects to two trials

without inferential failures was consistent with their expectation bias for inferential failures

developed over the previous three days. This avoided implicit suggestion that Days 5 and 6

were any different than the previous days with the exception of the controlling re-

introduction.

90
Failure occurences, Experiment 3

Day 1 Day 2 Day 3 Day 4 Day 5 Day 6

Training

6 sig, 6 inf 6 Control


sig, 0 inf 6 sig, 0 inf 6 sig, 0 inf 6 sig, 2 inf 6 sig, 2 inf

6 sig, 6 inf 6 sig, 0 inf 6 sig, 0 inf 6 sig, 0 inf 6 sig, 0 inf 6 sig, 0 inf

6 sig, 6 inf 6 sig, 2 inf 6 sig, 2 inf 6 sig, 2 inf 6 sig, 0 inf 6 sig, 0 inf

6 sig, 6 inf 6 sig, 2 inf 6 sig, 2 inf 6 sig, 2 inf 6 sig, 6 inf 6 sig, 6 inf

6 sig, 6 inf 6 sig, 4 inf 6 sig, 4 inf 6 sig, 4 inf 6 sig, 6 inf 6 sig, 6 inf

Figure 10. Experiment 3 experimental design, failure occurrences by failure type,

randomized by subject on Days 2 - 4. Comparison trials outlined in bold.

Training

The training session was 30 minutes at the beginning of Day 1, and was identical to

Experiment 2. A brief message appeared at the beginning of Day 2 informing subjects that

the number of failures per trial might vary.

At the beginning of Day 3, a message appeared explaining that subjects should monitor the

system and detect both failure types. In addition, it was explained that there would be some

trials in which subjects were required to control the system and detect failures. Subjects

were therefore instructed to check the display at the beginning of each trial to see if the

system was in “Automatic” or “Manual” control mode. In addition, subjects were informed

that they would again see different numbers of failures on each trial for the remainder of the

experiment.

91
Results

Within-subject comparisons for Trial (4 and 5) within Condition (Post-Control [PC] and

Post-Monitor [PM]), and Trial by Condition were made for both signaled failure RT and the

combined RT and inferential failure error rate measure used in Experiments 1 and 2. An

analysis of variance (ANOVA) was used to test for Trail and Condition main effects and

interactions. Only Trials 4 and 5 on Conditions PC and PM were analyzed for differences.

Although the post-control and post-monitor conditions occurred on both Day 5 and Day 6

because of the counter-balancing, results are reported in a non counter-balanced manner as

condition Post-Control [PC] and condition Post-Monitor [PM] for purposes of clarity. The

raw RT and error rate data for Inferred failures are provided in Appendix D.

Signaled Failures

Signaled RT findings supported the experimental hypotheses and, as in Experiment 1,

suggested an inverse relationship between signaled and inferential failure detection

performance (see Figure 12.) There was a marginally significant main effect of Condition

(failure detection post-controlling [PC] versus post-monitoring [PM]; 846 vs. 719), [F(1,14)

= 3.39, p < .1), but no main effect for Trial (Trial 4 vs. Trial 5; 789 vs. 777). Further,

there was no significant Trial by Condition (PC vs. PM) interaction. Planned comparisons

for Trials between and within Conditions (PC vs. PM) for signaled RT yielded significant

differences. There were no significant differences within Condition PC for Trial (4 vs. 5;

827 vs. 865), nor for Condition PM (750 vs. 688). In addition, there was no significant

difference between Conditions for Trial 4 (827 vs. 750). However, there was a significant

difference between Conditions for Trial 5 (865 vs. 688), [F(1,14) = 4.9, p < .05) as shown

in Figure 11.

92
1000
900
Signaled Failure RT

800
700
600
500
400
300
200
4 4 5 5
Post- Post- Post- Post-
Control Monitor Control Monitor

Figure 11. Signaled failure detection performance, Trials 4 and 5, Conditions Post-Control

(PC) and Post-Monitor (PM).

Inferential Failures

Inferential failure detection performance supported the experimental hypothesis, but

differences were only marginally significant (p < .1). There was a marginally significant

main effect for Condition (PC vs. PM; 2.16 vs. 2.23), [F(1,14) = 3.39, p < .1], but no main

effect of Trial (4 vs. 5; 2.21 vs. 2.23). There was also a marginally significant Trial by

Condition interaction [F(1,14) = 3.74, p < .1], as post-controllers improved from Trial 4 to

Trial 5, but subject’s performance in the post-monitoring condition worsened from Trial 4

to Trial 5.

Planned comparisons of inferred failure detection yielded marginally significant differences

in the directions predicted by the hypothesis. There were no significant differences within

93
Condition PC comparing Trial 4 versus 5 (2.26 vs. 2.06, p = .13), nor in Condition PM

(2.16 vs. 2.31, p = .2), although the trend suggested by these data is interesting and will be

discussed further. There was no significant difference by Condition (PM and PC) for Trial

4 (2.26 vs. 2.16). However, there was a marginally significant difference by Condition for

Trial 5 (2.06 vs. 2.31), [F(1,14) = 3.21, p < .1) as shown in Figure 12. Because of the

marginally significant results using the combined index, RT and error rate were analyzed

separately for Trial 5. While there was no significant difference for RT (PC vs. PM; 8674

vs. 8839), there was a marginally significant difference for error rate (.2 vs. .29), [F(1,14) =

3.55, p < .1].

1000 2.35
Inferred failure index

2.3
Signaled failure RT

900
800 2.25
700 2.2
2.15 Signaled
600
2.1 Inf errred
500 2.05
400 2
300 1.95
200 1.9
4 4 5 5
Post- Post- Post- Post-
Control Monitor Control Monitor

Figure 12. Inverse relationship of Inferred and Signaled failure detection performance,

Trials 5, Conditions Post-Control and Post-Monitor.

Day 6 Controllers - Separate Analysis

Because the design was counter-balanced between Days 5 and 6, the possibility existed that

the benefit of controlling on Day 5 would persist longer than the superseding trials and

94
perhaps into the next day. This effect would thus prevent a controller advantage from

appearing in the data since the comparison trials for Day 5 controllers were Trials 4 and 5

on Day 6. Therefore, a separate analysis was conducted using only Day 6 controllers

(who’s comparison trials were from Day 5 and thus untainted by prior controlling.)

While these results are potentially confounded by Day, they did support the primary

hypothesis that controlling benefits subsequent monitoring performance, but also the

hypothesis that this benefit may be of considerable duration. There was a significant

difference for condition (PC vs. PM), showing an advantage for post-controllers in the

combined measure for inferential failures, (1.8 vs. 2.31), [F(1,7) = 6.08, p < .05).

Additionally, the same comparison for error rate yielded a significant difference, (.21 vs.

.5), [F(1,7) = 7.98, p < .05). There were no other significant differences. However, the

signaled and inferred RT means were all in the same directions as the data from the

analysis when both groups were used (Day 5 and Day 6 controllers.)

Discussion

The results of Experiment 3 support the hypothesis that periodic controlling can improve

subsequent monitoring performance and, importantly, increase the external validity of this

paradigm. Past research comparing the monitoring performance of individuals who

previously monitored and controlled using tracking tasks (Kessel & Wickens, 1982; Young,

1995), as well as Experiment 2 of this dissertation which used a more cognitively-oriented

dynamic task, strongly suggest that controlling a system makes one more sensitive to

dynamic features of the system and thus more sensitive to system failures. While previous

findings are theoretically important, their ecological validity is questionable because in

most operational environments in which operators monitor dynamic systems, their training

(and perhaps some operational experience) includes hands on system control.

95
This validity concern is especially acute in aviation environments where operators only

experience extended monitoring requirements after hundreds, if not thousands of hours

controlling the system manually. The design used in Experiment 3, however, demonstrates

that individuals with considerable hands-on manual control experience, then subsequent

monitoring exposure, will benefit from periodic reintroduction to controlling. While the

specific results showing improved inferred failure detection were marginally significant in

Trial five, this view is supported by Parasuraman, Mouloua, and Molloy (1996), who found

that monitoring performance was superior after a ten minute period in which some of the

previously automated tasks were returned to operator control. Although the anticipated

results were not present in Trial 4, the abrupt transition from controlling to monitoring was

likely responsible for the poor performance in Trial 4.

Signaled Failures

The differentiation between signaled and inferred failures was originally developed to

distinguish between performance improvements which were simply vigilance-like in nature,

versus performance improvements which were dependent on a more proficient or activated

state of system knowledge (see Experiment 1 for details). Since my theory states that

controlling should yield a more “activated” mental model for the operator or, more

specifically, a transition from the static system-operation based model to a dynamically

activated current state system based mental model, it was hypothesized that inferential

failures would be detected more easily when the operator’s mental model was in its dynamic

activation state. However, signaled failures, which required the operator to respond to

simple stimuli, should remain unaffected by mental model activation because effective

analysis of system behavior yields no advantage for the detection of a signaled failure. In

96
essence, an individual could have no understanding of system operation, yet be perfectly

effective in detecting signaled failures.

Results from Experiment 3 yielded the finding that past controllers, who presumably

benefited from a more activated mental model, were poorer at detecting signaled failures.

The likely explanation is that one aspect of an activated mental model is that subjects spend

more time scanning for vital information on the display (e.g., throttle information in the

periphery of the display), and thus less time focused in the center of the display where the

signaled failures occurred. While this phenomenon was not predicted, it is consistent with

the view that a proficient mental model guides perceptual activity to those features of the

task critical for success (Endsley, 1995).

This same signaled/inferred failure detection trade-off occurred in Experiment 1, and led to

features of Experiment 2 designed to explore this potential effect. The primary

modification to Experiment 2 was the removal of the peripherally located throttle

information on half of the trials. If past controllers’ signaled failure detection disadvantage

was a result of more time spent looking at throttle information, then removal of the throttle

should alter this signaled failure detection deficit. Results from Experiment 2 were

somewhat surprising in that past-controllers were significantly faster than past-monitors

over both conditions (discussed in detail in Discussion 2). However, with past-controllers

there was a significant difference in the signaled failure detection performance between the

throttle Visible and throttle NotVisible conditions, with poorer performance occurring in

the presence of the throttle display. There was no such effect for the past-monitors. This

finding supports the speculation that the presence of the throttle, and consequently the

subject’s attention to it, has a negative effect on signaled failure detection performance.

97
Further, it supports the contention that the past-controllers paid more attention to the

throttle than did the past-monitors when it was present.

Because the relationship between signaled failure detection performance and the post-

control condition were different in Experiment 1 than in Experiment 2, no prediction was

made for this relationship for Experiment 3. It should be noted that this effect would have

little operational significance, since even when significant the response time differences for

signaled failures were relatively small (e.g., 667ms vs. 604ms, from Experiment 2). Rather,

signaled failure detection performance serves as a measure of vigilance and behavioral

differences between groups, not as a measurement of how an operator would respond to a

signaled failure in a real task.

Interestingly, results from Experiment 3 paralleled Experiment 1 in that signaled failure

detection performance in the post-controlling condition was significantly worse than in the

post-monitoring condition. Likewise, inferential failure detection performance was better,

although only marginally significant, for the post-controllers. I believe that there are three

possible explanations for the apparent trade-off between signaled and inferential failure

detection performances. All likely explanations originate from the central point that

controlling makes a subject spend more time focused on throttle information and less time

focused on the center of the display where signaled failures occur. Because subjects do

focus more attention on the throttle, it is presumed that this information, at least in part, is

responsible for their improved inferential failure detection performance.

The first explanation for this trade-off in inferred and signaled failure detection

performance is based on the fact that subjects must allocate more resources to the throttle

portion of the task while controlling because one of their controlling tasks is throttle

98
management This task requires that subjects monitor the throttle display quite diligently

(subjects were told on the first day of the experiment that their bonus would be partially

determined by how well they managed the throttle on controlling trials) and to use the joy

stick to manipulate throttle position while controlling during controller re-introduction. It

is possible that the act of performing the task simply reinforces a scanning pattern which

incorporates the throttle. This explanation implies that when subjects return to the

monitoring task, their scanning behavior incorporates the throttle not because of increased

perceptual sensitivity to system attributes nor because a more activated mental model is

driving perceptual efficiency. Rather, it implies that scanning behavior is a result of

unconscious habit which, after controlling the system, happens to result in less scan time in

the center of the display and, therefore, poorer signaled detection performance. This

explanation, however, is not supported by the results. If a habit change was responsible for

the effect, then one would expect the strongest effect to occur directly after the controlling

condition, then weaken as subjects adapted to the monitoring task. However, this effect was

only present in Trial 5 of Experiment 3 when it should have been weakening. The

monitoring trial directly after the controlling re-introduction, Trial 4, showed no difference

between the post-control and post-monitor conditions. Additionally, if the origin of

scanning differences were habitual rather than cognitively driven, it seems unlikely that

there would be a resulting pay-off in inferential failure detection, although this is more

difficult to verify.

The second explanation for the change in scanning behavior is that the relationship

between throttle activity and overall fuel system behavior is strengthened when subjects are

forced to manipulate the throttle level by hand in the manual mode. This explanation

implies that continuous monitoring has the effect of weakening an individual’s

understanding of subtle system features, or perhaps causes individuals to fixate on features

99
of the task they perceive as having the greatest pay-off in terms of failure detection. This

shift, however, must be involuntary since subjects are instructed to detect failures to the best

of their ability in every condition. There is, therefore, no valid reason for subjects to

intentionally select a less effective strategy. While this shift may be characterized as

“peripheralisation” (Satchell, 1993) or perhaps “automation induced complacency”

(Parasuraman, et al., 1993), it is difficult to understand its origin. Perhaps the forced

inactivity of monitoring may induce a cognitive apathy or greater cognitive inactivity

which, unbeknownst to the subjects, has a harmful effect on their inferential failure

detection performance.

The final explanation for this trade-off, and the explanation most consistent with the

hypotheses, is that the reintroduction of controlling both the throttle and fuel pumps has the

effect of strengthening, or re-activating subjects’ mental models of the system. The effect of

this heightened system understanding, and the resultant increased sensitivity to system

operation, is that subjects pay greater attention to throttle activity and benefit from the

information it provides. Further, this explanation is consistent with Endsley’s (1995) view

that a good mental model guides perceptual activity to cues. This argues that the perceptual

process is generally outside of conscious awareness, and anecdotal evidence from subjects’

post-experiment comments suggests they were unaware that their attention to the throttle

display had changed in any way.

Inferential Failures

Inferential failure detection performance was in the pattern predicted by the hypotheses,

although mean differences were only marginally significant. In the post-controlling

condition, subjects were better at detecting inferential failures than when they had been

100
continuously monitoring. While the pattern between Trials 4 and 5 (the two comparison

trials) was not predicted, a marginally significant difference between groups occurred on

the fifth trial. Performance differences between the two conditions on Trial 4 were not

significant, and suggest that there is a transitionary period as subjects transfer from a

controlling to a monitoring mode. This is not surprising given the large differences

between the two tasks, but is likely a factor in need of further study before controlling

becomes an operational method of improving monitoring performance.

More importantly, however, I believe the fact that post-controllers improved from Trial 4 to

5, yet performance decreased in the post-monitoring condition, is operationally quite

significant. This is especially true given that the Experiment 3 pilot study data suggest that

subjects’ performances reached a negative asymptote by Trial 4 and remained poor through

Trial 5. This suggests that periodic controller re-introduction may have the effect of

“resetting the clock” on the deterioration of monitoring performance. The effect presumes

that controlling the system is a considerably different task than monitoring the system, as is

the case in this paradigm (while the objective is the same, the subjects’ activities between

the two tasks are quite different). However, in most operational settings, the difference

between controlling and monitoring is likely as profound as it is in this paradigm.

Further, the “resetting the clock” concept is quite consistent with the theory that controller

re-introduction has the effect of re-activating the operator’s mental model of the system,

thus shifting the state of the operator’s model away from the static state and towards the

dynamic mental model state. If mental models have a state of activation, as proposed in

this theory, then it is likely that there must be some decay of this activation. Viewed

another way, a dynamic mental model can only remain dynamic, and thus provide

perceptual and calculational benefits, for a certain period of time after the features of the

101
task supporting the dynamic activation cease. While this issue is not directly addressed by

this research (other than at a speculative level), it is another element of the theory which

needs further exploration. Systematic exploration of dynamic mental model decay could

provide critical information for the use of periodic controlling to ward off the negative

effects of continuous monitoring. As discussed in the beginning of this dissertation, it is

extremely unlikely that commercial aviation would ever return to an exclusive manual

control environment. However, controlling might be used for short periods of time to

produce the desired effect. It is critical, therefore, that the exact duration of the positive

effect of controlling be measured in addition to the amount of controlling required to

produce this positive benefit.

It is unfortunate that the inferential failure detection differences were only marginally

significant, but it should be noted that the design of this experiment produced an extremely

conservative test of the hypothesis. Young (1995), using a two dimensional tracking task,

found that subjects who controlled the system in the training portion of the transfer-of-

training experiment without exposure to failures did not show improved failure detection

performance during the monitoring portion of the experiment. This led Young (1995) to

conclude that controlling with failures, rather than just controlling, was responsible for the

improved failure detection performance of past-controllers. This conclusion was

problematic when generalizing to an operational environment because it implied that, in

order to derive benefit from controlling, the specific failures potentially encountered during

monitoring needed to be experienced while controlling. However, in Experiment 3 the

controller re-introduction intentionally excluded inferential failures. This was done to

achieve maximum external validity, and because it would be the most stringent test of the

theory that mental model re-activation was responsible for post-controlling inferential

failure detection performance, and not a result of specific failure type sensitivity.

102
The amount of time subjects spent controlling during controller re-introduction was

somewhat arbitrary. Trials 2 and 3 were chosen because they avoided the use of trial one

(the trial shown to be significantly different from the other trials in the Experiment 3 pilot

study), yet still allowed two trials for comparison at the end of the session. Given the

transitional factors likely affecting the fourth trial, and the fact that some subjects stated in

post-experiment interviews that they were caught off guard by the re-introduction of

controlling, it is likely that some subjects only had solid controlling experience in the third

trial. Each trial lasted 12 minutes, giving subjects 24 minutes or less of controlling

experience depending on how quickly they recognized and adjusted to the change in

participatory mode. Given the relative difficulty of the controlling task, the improvement in

inferential failure detection performance may have been larger if subjects had been afforded

more controlling time. Additionally, given the general downward trend in performance

over trials (as seen in the Experiment 3 pilot study and in Experiment 3 itself), it is likely

that a stronger effect would also have been achieved by increasing the amount of

monitoring time on days five and six when controller re-introduction did not take place.

Another potential factor affecting the results was that subjects did not perform the task at

the same time each day. Subjects were required to participate in the experiment for six

consecutive days. However, in order to accommodate student schedules, subjects were

allowed to participate anytime during the day. While it was originally thought that this

latitude would have little or no effect, post-experiment interviews revealed several

comments such as “I sure did a lot better on that experiment in the morning.” While the

variance attributed to this factor is unknown, future multi-day experiments should require

subjects to participate at the same time each day.

103
Experiment 3 was a successful test of the hypothesis because: a) signaled failure detection

performance was affected significantly by the controller re-introduction manipulation in a

direction predicted by the mental model hypothesis, b) inferential failure detection

performance was affected at a marginally significant level in a direction supporting the

mental model hypothesis, and c) both effects were strongest in the fifth trial suggesting

some long term reverse in the negative effects of continuous monitoring on failure detection

performance. While Experiment 3 generated several important new questions, it also

convincingly demonstrated that controller re-introduction during extended periods of

monitoring can have positive benefits for monitoring performance using an ecologically

valid design. Further, when Experiment 3 is considered in the context of this research and

other published works on this topic, it becomes increasingly clear that periodic return to

manual control may be one of the best weapons for fighting the negative effects of

continuous monitoring in operational settings.

Conclusion

This research adds considerable depth to previous studies in this field showing that manual

control of systems produces better system monitors. It lends support to the notion suggested

by myself and others (Endsley, 1995; Kessel & Wickens, 1982; Parasuraman, et al., 1996)

that the construct of a mental model may be the appropriate mechanistic explanation for the

“out-of-the-loop” performance deficits experienced by continuous system monitors. While a

mental model explanation for psychological phenomena may harbor seemingly excessive

complexity compared to simpler knowledge representation or learning approaches, I believe

that in the context of complex cognitive vigilance tasks, it effectively captures the

procedural, semantic, perceptual, and calculational factors affecting individuals’

performances in dynamic task environments. The operator is performing a complex

104
operation, and it may be a complex explanation which best captures this behavior. The use

of both signaled and inferred failures in this paradigm was novel and effective in

differentiating between vigilance decrements and deficits in the level of activation of the

operator’s mental model. To my knowledge, these experiments are the first to use failures

requiring different levels of cognitive processing for their detection in a single complex

vigilance task.

The first objective of this research was to extend findings that past controllers make better

system monitors by using an experimental paradigm more representative of real-world,

dynamic monitoring tasks. This was accomplished in Experiment 2. However, while

Experiment 2 replicated the basic findings that controllers make better monitors, the use of

the throttle mechanism as both a separate controlling task and as an important information

component of the monitoring task brought to light an important theoretical result. It

appears that operators who are trained by controlling a system develop a higher level of

understanding of system operation, especially in relation to features of that system which

they control. When those subjects are then placed in a monitoring condition, their more

comprehensive understanding allows them greater acuity for important system behaviors,

and they use that information to effectively detect failures. Not only is this finding

theoretically significant in its own right, but it supports contentions by Moray (1986) that

system monitors must be trained in a manual control mode if they are expected to

understand and appreciate system operation at a high level.

The fact that past-controllers appear to scan important features of the display for system

information offers a potential explanation for aircraft accidents in which programming

errors were made on the aircraft’s FMS, yet pilots failed to observe the aircraft’s unintended

behavior. In each of these occurrences, ample evidence was available on the displays, yet

105
the pilots failed to perceive this information and process it to a level which should have

signaled the existence of a serious problem. In effect, because the pilots had been

monitoring for extended periods of time before these incidents, their perceptual activity was

blinded because system monitoring failed to require perception of these system variables,

and their perceptual cycle became derailed, at least in relation to the primary goal. In

essence, the inactivity of monitoring yielded a weak dynamic execution of their flying

mental model so that access and understanding of subtle system behavior and the

consequent perceptual activity were severely affected. While this is speculation, I believe

that it is the best explanation to date as to why experienced pilots failed to perceive a

multitude of cues that indicated they were in grave danger.

This finding also supports the contention by Smolensky (1993) that the notion of situational

awareness may be related to certain physiological attributes. In fact, this finding strongly

supports the view that ocular movement may be a strong predictor of one’s situational

awareness at a given time. Situational awareness implies a strong dynamic execution of an

operator’s mental model of a task, and thus highly efficient perceptual activity as the

operator updates and integrates information pertaining to the task. I believe that this

perceptual activity should have a strong effect on one’s ocular movement, and is likely to be

highly indicative of one’s level of situational awareness.

The purpose of Experiment 3 was to use the ecologically valid task of the first two

experiments implemented in an experimental design more representative of real world

operations. The new design allowed all subjects to learn and perfect the task in a

controlling mode, as is the case in the aviation domain. Subjects then monitored for several

days, and after extensive monitoring, they were momentarily re-introduced to the

controlling task. Even this multi-day design compresses time compared to most operational

106
settings, but it is more realistic than previous research and external validity is increased by

insuring that all subjects are trained in the same hands-on manner. Results from this

experiment showed that even a 24 minute controller re-introduction can have a positive

impact on subsequent inferential failure detection performance. Importantly, signaled

failure detection performance was significantly worse after the controlling re-introduction.

The combination of improved inferred and poorer signaled failure detection performance

imply that even a short period of manual control within an extended period of monitoring

can cause subjects to return to a more effective pattern of scanning, while perceiving and

integrating system information more effectively. Further, controlling seemed to have the

effect of “resetting the clock” so that after a true 50 minutes of system exposure subjects

were performing as if they had just started the task, even though in previous experiments

general failure detection performance decreased as a factor of time on the task.

The results of Experiment 3 are the strongest evidence yet that periodic controller

reintroduction may be the best tool for airlines and other monitoring-intensive operations to

fight detrimental “out-of-the-loop” performance effects (Endsley, 1995). While the various

perspectives on this problem were outlined in the introduction of this paper, they have

generated few concrete solutions. However, the “controlling solution” appears not only to

be effective, but also easily implemented. In fact, the only cost seems to be the slight loss in

operational efficiency which occurs when human operators take control for a period of time.

Of course, before this recommendation is implemented, several important questions need to

be answered. First, an experiment of similar design should be conducted in a highly

realistic full mission simulator using commercial pilots. Second, more experimentation

needs to take place regarding the length of controlling versus the amount of resulting

positive benefit. It seems quite likely that the law of diminishing returns would apply to

107
controller re-introduction, but that point can not be determined without further

experimentation. In the same vein, it is also important to know the rate of decay of the

operator’s dynamic execution of their mental model, assuming the pilot hand-flies the

beginning of the mission, and only later is relegated to system monitor. It is also likely that

this factor is highly task dependent ranging from several minutes to several days. In fact, it

seems likely that the decay occurs in two dimensions: one dimension being task

complexity, the other capturing the extent to which it is a motor versus a cognitive task.

While these questions will be time consuming to answer, they will certainly yield vital

information for operational settings.

My goal with this research has been to elevate and refine past results showing the benefits

of controlling, and guide this line of research in a direction most beneficial for commercial

aviation and other industrial tasks which utilize continuous monitoring. I believe this line

of work shows that while the “out-of-the-loop” performance problem is both real and

serious, potential solutions are available. Further, unlike most solutions to serious

problems, where the benefit only slightly outweighs the costs, this “controlling solution”

has large benefits with few costs.

108
REFERENCES

Adams, J. A., Humes, J. M., & Stenson, H. H. (1962). Monitoring of complex


visual displays: III. Effects of repeated sessions on human vigilance. Human Factors, 4,
149-158.

Adams, J. A., Stenson, H. H., & Humes, J. M. (1961). Monitoring of complex


visual displays: II. Effects of visual load and response complexity on human vigilance.
Human Factors, 3, 213-221.

Adams, J. A., Tenney, Y. T., & Pew, R. W. (1995). Situational Awareness and
the Cognitive Management of Complex Systems. Human Factors, 37(1), 85-104.

Billings, C. E. (1991). Human-centered aircraft automation: A concept and


guidelines (NASA Tech. Memorandum 103885). Moffett Field, CA: NASA Ames
Research Center.

Borgman, C. L. (1986). The user’s mental model of an information retrieval


system: An experiment on a prototype on-line catalogue. International Journal of Man
Machine Studies, 24, 47-64.

Comstock, J. R., & Arnegard, R. J. (1992). The multi-attribute task battery for
human operator workload and strategic behavior research (Tech, memorandum 104174).
Hampton, VA: NASA Langley Research Center.

Confusion over flight mode may have role in A320 crash. (1992, Feb. 3).
Aviation Week & Space Technology, p.29.

Covey, R. R., Mascetti, G. J., Roessler, W. U., & Bowles, R., (1979, December).
Operational energy conservation strategies. Proceedings of the Institute of Electrical and
Electronic Engineers Conference on Decision and Control. Ft. Lauderdale.

109
Crash triggers review of AMR. (1996, January 1). Aviation Week & Space
Technology, p.30.

Curry, R. E. (1985). The introduction of new cockpit technology: A human


factors study (NASA Tech. Memorandum 86659). Moffett Field, CA: NASA Ames
Research Center.

Curry, R. E., & Eprath, A. R. (1976). Monitoring and control of unreliable


systems. In T. B. Sheridan and G. Johannsen (Eds.), Monitoring and Supervisory Control.
New York: Plenum.

Endsley, M. R. (1995). Toward a theory of situational awareness in dynamic


systems. Human Factors, 37(1), 32-64.

Endsley, M. R., & Kiris, E. O. (1995). The out-of-the-loop performance problem


and level of control in automation. Human Factors, 37(2), 381-394.

Ephrath, A. R., & Curry, R. E. (1977). Detection by Pilots of System Failures


During Instrument Landings. IEEE Transactions on Systems, Man and Cybernetics, SMC-
7(12), 841-848.

Eprath, A. R., & Young, L. R. (1981). Monitoring vs. Man-in-the-Loop


Detection of Aircraft Control Failures. In J. Rasmussen and W. B. Rouse (Eds.), Human
Detection and Diagnosis of System Failures. New York: Plenum.

Flach, J. M. (1995). Toward a theory of situational awareness in dynamic


systems. Human Factors, 37(1), 149-157.

Indian A320 crash probe data show crew improperly configured aircraft. (1990,
June 25). Aviation Week & Space Technology, p.84.

Jaginski, R. J., & Miller, R. A. (1978). Describing the human operator's internal
model of a dynamic system. Human Factors, 20, 425-439.

110
Johannsen, G., Pfendler, C., & Stein, W. (1976). Human performance and
workload in simulated landing approaches with autopilot-failures. In T. B. Sheridan and
G. Johannsen (Eds.), Monitoring and Supervisory Control. New York: Plenum.

Johnson-Laird, P. N. (1989). Mental Models. In M. I. Posner (Ed.), Foundations


in cognitive science. Cambridge: MIT Press.

Johnson-Laird, P. N. (1983). Mental Models. Cambridge: Cambridge University


Press.

Johnson-Laird, P. N. (1981). Mental models in cognitive science. In D. A.


Norman (Ed.), Perspectives on cognitive science (pp. 147-191). Norwood, NJ: Albex;
Hillsdale, NJ: Erlbaum.

Jordan, T. C. (1972) Characteristics of visual and proprioceptive response times


in the learning of a motor skill. Quarterly Journal of Experimental Psychology, 24, 536-
543.

Kantowitz, B. H., Casper, P. A (1988). Human workload in aviation. In E.


Wiener & D. Nagel (Eds.). Human factors in aviation. San Diego, CA: Academic Press,
Inc. (Chapter 6, pp. 157-188).

Kantowitz, B. H., & Sorkin, R. D. (1983). Human factors: Understanding


people-system relationships. New York: Wiley.

Kessel, C., & Wickens, C. D. (1982). The transfer of failure-detection skills


between monitoring and controlling dynamic systems. Human Factors, 24(1), 49-60

Kieras, D., & Boviar, S. (1984). The role of mental models in learning to operate
a device. Cognitive Science, 8, 255-273.

Mackworth, N. H. (1950). Research on the measurement of human performance.


(Medical Research Council special report series no. 268. London: HM Stationery Office).

111
Reprinted in H. Sinaiko (Ed.), Selected papers on human factors in the design and use of
control systems. New York: Dover Publications, Inc., 1960.

Moray, N. (1986). Monitoring behavior and supervisory control. In K. Boff


(Ed.), Handbook of perception and human performance (pp. 40/1-40/51). New York:
Wiley.

Nagel, D. C. (1988). Human error in aviation operations. In E. L. Wiener and


D. C. Nagel (Eds.), Human factors in aviation. New York: Academic Press.

Neisser, U. (1976). Cognition and reality. San Francisco: W. H. Freeman and


Co.

Norman, D. A. (1988). The Psychology of Everyday Things. New York: Basic


Books.

Norman, D. A. (1983). Some observations on mental models. In D. Gentner &


A. Stevens (Eds.), Mental models (pp. 7-14). Hillsdale: Erlbaum.

Norman, S., Billings, C. E., Nagel, D., Palmer, E., Wiener, E. L., & Woods, D. D.
(1988). Aircraft automation philosophy: A source document. Flight deck automation:
Promises and realities, [Workshop manual]. NASA Ames Research Center: Moffett Field.

Parasuraman, R. (1986). Vigilance, monitoring, and search. In K. Boff, L.


Kaufman, and J. Thomas (Eds.), Handbook of perception and human performance. (pp.
43/1-43/35). New York: John Wiley & Sons.

Parasuraman, R. (1987). Human-computer monitoring. Human Factors, 29, 695-


706.

Parasuraman, R., Mouloua, M., & Molloy, R. (1996). Effects of Adaptive Task
Allocation on Monitoring of Automated Systems. Human Factors, 38(4), 665-679.

112
Parasuraman, R., Molloy, R., & Sing, I. L. (1993). Performance consequences of
automation induced "complacency." International Journal of Aviation Psychology, 3(1), 1-
23.

Posner, M. I., Nissen, M. J., & Klein, R. M. (1976). Visual dominance: An


information processing account of its origins and significance. Psychology
Review, 83(2), 157-170

Sarter, N. B., & Woods, D. D. (1995). How in the world did we ever get into that
mode? Mode error and Awareness in supervisory control. Human Factors, 37(1), 5-19.

Sarter, N. B. & Woods, D. D. (1994). Pilot interaction with automation II: An


experimental study of pilots' model and awareness of the flight management system.
International Journal of Aviation Psychology, 4(1), 1-28.

Sarter, N. B., & Woods, D. D. (1992). Pilot interaction with cockpit automation:
Operational experiences with the flight management system. The International Journal of
Aviation Psychology, 2(1), 303-322.

Sarter, N. B., & Woods, D. D. (1991). Situation awareness: A critical but ill-
defined Phenomenon. The International Journal of Aviation Psychology, 1(1), 45-57.

Satchell, P. M. (1993). Cockpit monitoring and alerting system. Ashgate


Publishing: Aldershot, England.

Sekigawa, E., & Mecham, M. (1996, July 29). Pilots, A300 systems cited in
Nagoya Crash. Aviation Week & Space Technology, 36-37.

Sing, I. L., Molloy, R. & Parasuraman, R. (1993). Individual differences in


monitoring failures in automation. Journal of General Psychology, 120(3), 257-276.

Smolensky, M. W. (1993). Toward the physiological measurement of situational


awareness: The case for eye movement measurements. Proceedings of Human Factors and
Ergonomics Society 37th Annual Meeting, 41.

113
Thackray, R. I., & Touchstone, R. M. (1989). Detection efficiency on an air
traffic control monitoring task with and without computer aiding. Aviation, Space and
Environmental Medicine, 60, 744-748.

Van Cott, H. P., Wiener, E. L., Wickens, C. D., Blackman, H. S., & Sheridan, T.
B. (1996, October). Smart automation enhances safety: A motion for debate. Ergonomics
in Design, 4(4), 19-23.

Wickens, C. D. (1992). Engineering Psychology and Human Performance, New


York, NY: Harper-Collins.

Wickens, C. D., & Kessel, C. (1979). The effects of participatory mode and task
workload on the detection of dynamic system failures. IEEE Transactions on Systems,
Man, and Cybernetics, SMC-9(1), 24-34.

Wickens, C. D., & Kessel, C. (1980). Processing resource demands of failure


detection in dynamic systems. Journal of Experimental Psychology: Human Perception
and Performance, 6(3), 564-577.

Wickens, C. D., & Kessel C. (1981). Failure detection in dynamic systems. In J.


Rasmussen and W. B. Rouse (Eds.), Human detection and diagnosis of system failures.
New York: Plenum.

Wiener, E. L. (1993). Life in the second decade of the glass cockpit. Proceedings
of the Seventh International Symposium on Aviation Psychology, 1-11.

Wiener, E. L. (1989). Human factors of advanced technology (“glass cockpit”)


transport aircraft (NASA Tech. Memorandum 177528). Moffett Field, CA: NASA Ames
Research Center.

Wiener, E. L. (1988). Cockpit automation. In E. Wiener & D. Nagel (Eds.),


Human factors in aviation. San Diego, CA: Academic Press.

114
Wiener, E. L. (1985). Cockpit automation: In need of a philosophy (SAE Tech.
paper 851956). Washington D. C.

Wiener, E. L. (1984). Vigilance and inspection. In J. Warm (Ed.), Sustained


attention in human performance. John Wiley & Sons: New York.

Wiener, E. L., & Curry, R. E. (1980). Flight deck automation: Promises and
problems. Ergonomics, 23(10), 995-1011.

Williams, M. D., Hollan, J. D., & Stevens, A. L. (1983). Human reasoning about
a simple physical system. In D. Gentner & A. Stevens (Eds.), Mental models (pp. 131-
153). Hillsdale: Erlbaum.

Wilson, J. R., & Rutherford, A. (1989). Mental models in human factors.


Human Factors, 31(16), 995-1011.

Woodson, W. E. (1981). Human factors design handbook. New York: McGraw-


Hill.

Young, G. E. (1995). The Impact of Trial Length and Mode Experience on


Failure-Detection Performance in Monitored and Controlled Dynamic Tasks. Proceedings
of the Eighth International Symposium on Aviation Psychology, 1031-1036.

Young, R. L. (1969). On adaptive manual control. IEEE Transaction of Man


Machine Systems, MMS-10, 292-331.

115
APPENDIX A: Experimental task.

116
APPENDIX B: Experiment 1 Inferred failure RT and error rate.

Session 1 Session 2
Day 1 Day 2 (monitoring)

Controllers Auto-pilot
Control
Yoked

Auto-pilot
"Auto-pilot" monitors
Auto-Pilot
Yoked

Auto-pilot
"Yoked" monitors
Yoked
Yoked
Experiment 1 participatory modes.
Session 1 Session 2
Day 1 Day 2 (monitoring)

Controllers .53/8963
.51/817l
.49/8660

.7/8116
"Auto-pilot" monitors
.77/10506
.62/10401

.76/8911
"Yoked" monitors
.74/8389
.58/9026

Experiment 1 Inferred failure detection performance, error rate and reaction times. (Error
rate/RT).

117
APPENDIX C: Experiment 2 Inferred failure RT and error rate.

Session 1 Session 2
Day 1 Day 2 Day 3, (Monitoring)

Throt Vis
Controllers
Control Control
Throt NotVis

Throt Vis
"Yoked" monitors
Monitor Monitor
Throt NotVis

Experiment 2 participatory modes.

Session 1 Session 2
Day 1 Day 2 Day 3, (Monitoring)

.21/7315
Controllers
.37/8467l .26/8310
.24/7398

.3/7627
"Yoked" monitors
.44/8401 .33/8098
.25/7845

Experiment 2 Inferred failure detection performance, error rate and reaction times. (Error
rate/RT).

118
APPENDIX D: Experiment 3 Inferred failure RT and error rate.

Day 1 Day 2 Day 3 Day 4 Day 5 Day 6

Training

Control Control Auto-Pilot Auto-Pilot Auto-Pilot Auto-Pilot

Control Control Auto-Pilot Auto-Pilot Control Auto-Pilot

Control Control Auto-Pilot Auto-Pilot Control Auto-Pilot

Control Control Auto-Pilot Auto-Pilot Auto-Pilot Auto-Pilot

Control Control Auto-Pilot Auto-Pilot Auto-Pilot Auto-Pilot

Experiment 3 participatory modes, comparison trials in bold.

Day 1 Day 2 Day 3 Day 4 Day 5 Day 6

Training

Control Control Auto-Pilot Auto-Pilot Auto-Pilot Auto-Pilot

Control Control Auto-Pilot Auto-Pilot Control Auto-Pilot

Control Control Auto-Pilot Auto-Pilot Control Auto-Pilot

Control Control Auto-Pilot Auto-Pilot .25/8902 .24/8277

Control Control Auto-Pilot Auto-Pilot .2/8674 .29/8839

Experiment 3 Inferred failure detection performance, error rate and reaction times. (Error
rate/RT).

119

Vous aimerez peut-être aussi