Vous êtes sur la page 1sur 26

THE HUMAN-COMPUTER

INTERACTION HANDBOOK
Fundamentals, Evolving Technologies
and Emerging Applications

JULIE A. JACKO, Editor

Georgia Institute of Technology

ANDREW SEARS, Editor


UMBC

LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS


2003 Mahwah, New Jersey
London

56
USER-BASED EVALUATIONS
Joseph S. Dumas
Oracle Corporation

Introduction
1094
User-Administered Questionnaires
1094
Off-the-Shelf Questionnaires
1095
Observing Users
1096
Empirical Usability Testing
1096
The Focus Is on Usability
1097
The Participants Are End Users or Potential End
Users
1098
There Is a Product or System to Evaluate
1099
The Participants Think Aloud As They Perform Tasks... 1099
The Participants Are Observed, and Data
Are Recorded and Analyzed
1102
Measures and Data Analysis
1105
Communicating Test Results
1107
Variations on the Essentials
1108
Measuring and Comparing Usability
1108
Comparing the Usability of Products
1108
Baseline Usability Tests
1110

Allowing Free Exploration


Challenges to the Validity of Usability Testing
How Do We Evaluate Usability Testing?
Why Can't We Map Usability Measures
to User Interface Components?
Are We Ignoring the Operational Environment?
Why Don't Usability Specialists See the Same
Usability Problems?
Additional Issues
How Do We Evaluate Ease of Use?
How Does Usability Testing Compare With Other
Evaluation Methods?
Is It Time to Standardize Methods?
Are There Ethical Issues in User Testing?
Is Testing Web-Based Products Different?
The Future of Usability Testing
Which User-Based Method to Use?
References

1093

1110
1110
1110
1111
1111
1112
1112
1112
1112
1113
1114
1114
1115
1115
1115

1094

DUMAS

INTRODUCTION
Over the past 20 years, there has been a revolution in the way
products, especially high-tech products, are developed. It is no
longer accepted practice to wait until the end of development
to evaluate a product. That revolution applies to evaluating usability. As the other chapters in this handbook show, evaluation
and design now are integrated. Prototyping software and the
acceptance of paper prototyping make it possible to evaluate
designs as early concepts, then throughout the detailed design
phases. User participation is no longer postponed until just before the product is in its final form. Early user involvement has
blurred the distinction between design and evaluation. Brief usability tests are often part of participatory design sessions, and
users are sometimes asked to participate in early user interface
design walkthroughs. Although the focus of this chapter is on
user-based evaluation methods, I concede that the boundary
between design methods and evaluation methods grows less
distinct with time.
In this chapter, I focus on user-based evaluations, which are
evaluations in which users directly participate. But the boundary between user-based and other methods is also becoming less
distinct. Occasionally, usability inspection methods and userbased methods merge, such as in the pluralistic walkthrough
(Bias, 1994). In this chapter, I maintain the somewhat artificial distinction between user-based and other evaluation methods to treat user-based evaluations thoroughly. I describe three
user-based methods: user-administered questionnaires, observing users, and empirical usability testing. In the final section of
the chapter, I describe when to use each method.

USER-ADMINISTERED QUESTIONNAIRES
A questionnaire can be used as a stand-alone measure of usability, or it can be used along with other measures. For example, a
questionnaire can be used at the end of a usability test to measure the subjective reactions of the participant to the product
tested, or it can be used as a stand-alone usability measure of the
product. Over the past 20 years, there have been questionnaires
that
Measure attitudes toward individual products
Break attitudes down into several smaller components, such
as ease of learning
Measure just one aspect of usability (Spenkelink, Beuijen, &
Brok, 1993)
Measure attitudes that are restricted to a particular technology,
such as computer software
Measure more general attitudes toward technology or computers (Igbaria & Parasuraman, 1991)
Are filled out after using a product only once (Doll &
Torkzadeh, 1988)
Assume repeated use of a product
Require a psychometrician for interpretation of results
(Kirakowski, 1996)

Come with published validation studies


Provide comparison norms to which one can compare results
Throughout this history, there have been two objectives
for questionnaires developed to measure usability: (a) create
a short questionnaire to measure users' subjective evaluation
of a product, usually as part of another evaluation method, and
(b) create a questionnaire to provide an absolute measure of the
subjective usability of a product. This second objective parallels
the effort in usability testing to find an absolute measure of usability, that is, a numerical measure of the usability of a product
that is independent of its relationship to any other product.
Creating a valid and reliable questionnaire to evaluate usability takes considerable effort and specialized skills, skills in
which most usability professionals don't receive training. The
steps involved in creating an effective questionnaire include the
following:
Create a number of questions or ratings that appear to tap
attitudes or opinions that you want to measure. For example,
the questions might focus on a product's overall ease of use.
At the beginning of the process, the more questions you can
create, the better.
Use item analysis techniques to eliminate the poor questions and keep the effective ones. For example, if you asked a
sample of users to use a product and then answer the questions,
you could compute the correlation between each question and
the total score of all of the questions. You would eliminate questions with low correlations. You would also eliminate questions
with small variances because nearly all of the respondents are
selecting the same rating value or answer. You would also look
for high correlations between two questions because this indicates that the questions may be measuring the same thing. You
could then eliminate one of the two.
Assess the reliability of the questionnaire. For example,
you could measure test-retest reliability by administering the
questionnaire twice to the same respondents, but far enough
apart in time that respondents would be unlikely to remember
their answers from the first time. You could also measure split
half reliability by randomly assigning each question to one of two
sets of questions, then administering both sets and computing
the correlation between them (Gage & Berliner, 1991).
Assess the validity of the questionnaire. Validity is the most
difficult aspect to measure but is an essential characteristic of
a questionnaire (Chignell, 1990). A questionnaire is valid when
it measures what it is suppose to measure, so a questionnaire
created to measure the usability of a product should do just that.
Demonstrating that it is valid takes some ingenuity. For example,
if the questionnaire is applied to two products that are known
to differ on usability, the test scores should reflect that difference. Or test scores from users should correlate with usability
judgments of experts about a product. If the correlations are
low, either the test is not valid or the users and the experts are
not using the same process. Finally, if the test is valid, it should
correlate highly with questionnaires with known validity, such
as the usability questionnaires discussed in the following subsections.

56. User-Based Evaluations

Off-the-Shelf Questionnaires
Because an effective questionnaire takes time and special skills
to develop, usability specialists have been interested in using
off-the-shelf questionnaires that they can borrow or purchase.
The advantages of using a professionally developed questionnaire are substantial. These questionnaires usually have been
developed by measurement specialists who assess the validity
and reliability of the instrument as well as the contribution of
each question.
Historically, there have been two types of questionnaires developed: (a) short questionnaires that can be used to obtain a
quick measure of users' subjective reactions, usually to a product that they have just used for the first time, and (b) longer
questionnaires that can be used alone as an evaluation method
and that may be broken out into more specific subscales.
Short Questionnaires. There have been a number of published short questionnaires. A three-item questionnaire was developed by Lewis (1991). The three questions measure the
users' judgment of how easily and quickly tasks were completed.
The Software Usability Scale (SUS) has 10 questions (Brooke,
1996). It can be used as a stand-alone evaluation or as part of a
user test. It can be applied to any product, not just software. It
was created by a group of professionals then working at Digital Equipment Corporation. The 10 SUS questions have a Likert
scale formata statement followed by a five-level agreement
scale. For example,

1095

hierarchically organized measures of 11 specific interface factors: screen factors, terminology and system feedback, learning
factors, system capabilities, technical manuals, online tutorials,
multimedia, voice recognition, virtual environments, Internet
access, and software installation.
Because QUIS's factors are not always relevant to every product, practitioners often select a subset of the questions to use
or use only the general questions. There is a long form of QUIS
(71 questions) and a short form (26 questions). Each question
uses a 9-point rating scale, with the end points labeled with
adjectives. For example,
Characters on the screen are:
Hard to read

Easy to read

There is a Web site for QUIS (www.lap.umd.edu/QUIS/index.


html). Licenses for use are available for a few hundred dollars.
The site also contains references to evaluations that have used
QUIS.
The Software Usability Measurement Inventory (SUMI) was
developed to evaluate software only (Kirakowski, 1996). It is a
well-constructed instrument that breaks the answers into six
subscales: global, efficiency, affect, helpfulness, control, and
learnability.
The Global subscale is similar to QUIS's general questions.
The SUMI questionnaire consists of 50 statements to which
users reply that they either agree, are undecided, or disagree.
For example:

I think that I would like to use this system frequently.


Strongly
Disagree
1

Strongly
Agree
2

Brooke (1996) described the scale and the scoring system,


which yields a single, 100-point scale.
A somewhat longer questionnaire is the Computer User Satisfaction Inventory (CUSI). It was developed to measure attitudes
toward software applications (Kirakowski & Corbett, 1988). It
has 22 questions that break into two subscales: affect (the degree to which respondents like the software) and competence
(the degree to which respondents feel they can complete tasks
with the product).
Stand-Alone Questionnaires. These questionnaires were
developed to measure usability as a stand-alone method. They
have many questions and attempt to break users' attitudes into
a number of subscales. The Questionnaire for User Interaction
Satisfaction (QUIS) was developed at the Human-Computer Interaction Lab (HCIL) at the University of Maryland at College
Park (Chin, Diehl, & Norman, 1988). QUIS was designed to assess users' subjective satisfaction with several aspects of the
human-computer interface. It has been used by many evaluators over the past 10 years, in part because of its inclusion in Shneiderman's (1997) editions. It consists of a set of
general questions, which provide an overall assessment of a
product, and a set of detailed questions about interface components. Version 7.0 of the questionnaire contains a set of demographic questions, a measure of overall system satisfaction, and

This software responds too slowly to inputs.


The instructions and prompts are helpful.
The way that system information is presented is clear and
understandable.
I would not like to use this software every day.
Despite it length, SUMI can be completed in about 5 minutes.
It does assume that the respondents have had several sessions
working with the software. SUMI has been applied not only to
new software under development, but also to compare software
products and to establish a usability baseline. SUMI has been
used in development environments to set quantitative goals,
track achievement of goals during product development, and
highlight good and bad aspects of a product.
SUMI's strengths come from its thorough development. Its
validity and reliability have been established. In addition, its
developers have created norms for the subscales so that you
can compare your software against similar products. For example, you could show that the product you are evaluating scored
higher than similar products on all of the subscales. The norms
come from several thousand respondents.
The questionnaire comes with a manual for scoring the questions and using the norms. The developers recommend that
the test be scored by a trained psychometrician. For a fee, the
developer will do the scoring and the comparison with norms.
The license comes with 50 questionnaires in the language of
your choice, a manual, and software for scoring the results and

1096

DUMAS

creating reports. The Web site for SUMI is http://www.ucc.ie/


hfrg/questionnaires/sumi/index.html.
Questionnaires can play an important role in a toolkit of usability evaluation methods. It is difficult to create a good one,
but there are several that have been well constructed and extensively used. The short ones can be used as part of other evaluation methods, and for most usability specialists, using them is
preferable to creating their own. The longer ones can be used
to establish a usability baseline and to track progress over time.
Even the longest questionnaires can be completed in 10 minutes or less. Whether any of these questionnaires can provide
an absolute measure of usability remains to be demonstrated.

OBSERVING USERS
Although observing users is a component of many evaluation
methods, such as watching users through a one-way mirror during a usability test, this section focuses on observation as a standalone evaluation method. Some products can only be evaluated
in their use environment, where the most an evaluator can do
is watch the participants. Indeed, one could evaluate any product by observing its use and recoding what happens. For example, if you were evaluating new software for stock trading,
you could implement it and then watch trading activity as it
occurs.
Unfortunately, observation has several limitations when used
alone (Baber & Stanton, 1996), including the following:
It is difficult to infer causality while observing any behavior. Because the observer is not manipulating the events that
occur, it is not always clear what caused a behavior.
The observer is unable to control when events occur. Hence,
important events may never occur while the observer is
watching. A corollary to this limitation is that it may take a
long time to observe what you are looking for.
Participants change their behavior when they know they are
being observed. This problem is not unique to observation;
in fact it is a problem with any user-based evaluation method.
Observers often see what they want to see, which is a direct
challenge to the validity of observation.
Baber and Stanton provide guidelines for using observation
as an evaluation method.
A method related to both observation and user testing is
private camera conversation (DeVries, Hartevelt, & Oosterholt,
1996). Its advocates claim that participants enjoy this method
and that it yields a great deal of useful data. The method requires
only a private room and a video camera with a microphone. It
can be implemented in a closed booth at a professional meeting,
for example. The participant is given a product and asked to
go into the room and, when ready, turn on the camera and
talk. The instructions on what to talk about are quite general,
such as asking them to talk about what they like and dislike
about the product. The sessions are self-paced but quite short
(5-10 minutes). As with usability testing, the richness of the

verbal protocol is enhanced when two or more people who


know each other participate together.
The product of the sessions is a videotape that must be
watched and analyzed. Because the participants are allowed to
be creative and do not have to follow a session protocol, it is
difficult to evaluate the usability of a product with this method.
A related method has been described by Bauersfeld and
Halgren (1996). The rationale behind this passive video observation is the assumption that a video camera can be less intrusive
than a usability specialist but more vigilant. In this method, a
video camera is set up in the user's work environment. There is
a second camera or a scan converter that shows what is on the
user's screen or desk surface. The two images are mixed and
recorded. Participants are told to ignore the cameras as much as
possible and to work as they normally would. Participants are
shown how to turn on the equipment and told to do so whenever they work. This method can be used during any stage of
product development and not just for evaluation.
Although passive video capture is done without a usability
specialist present, we still don't know whether participants act
differently because they know they are being taped. In addition,
the data must be extracted from the videotapes, which takes as
much time to watch as if the participant were being observed
directly. Still, this method can be used in situations in which an
observer can't be present when users are working.
EMPIRICAL USABILITY TESTING
Usability testing began in the early 1980s at a time when computer software was beginning to reach a wider audience than
just computing professionals. The explosion of end user computing was made possible by new hardware and software in
the form of both the mini- and microcomputer and expansion
of communications technology, which moved computing from
the isolated computer room to the desktop. The advent of the
cathode ray tube (CRT) and communications technology made
it possible to interact directly with the computer in real time.
The 1982 conference, Human Factors in Computer Systems, held at Gaithersburg, Maryland, brought together for
the first time professionals interested in studying and understanding human-computer interaction. Subsequent meetings of
this group became known as the Computer-Human Interaction
(CHI) Conference. At that first meeting, there was a session
on evaluating text editors that described early usability tests
(Ledgard, 1982). The reports of these studies were written in
the style of experimental psychology reports, including sections
titled "Experimental Design" and "Data Analysis," in which the
computation of inferential statistics was described.
But the reliance on psychological research experiments as
a model for usability testing was challenged early. Young and
Barnard (1987) proposed the concept of scenarios instead of
experiments, and 2 years later, CHI Conference writers were
discussing issues such as "The Role of Laboratory Experiments
in HCI: Help, Hindrance or Ho-Hum?"(Wolf, 1989).
The first books on HCI began appearing at this time. Perhaps
the most influential book on usability, Shneiderman's (1987)
first edition of Designing the User Interface, did not have a

56. User-Based Evaluations


section or index item for usability testing but did have one on
quantitative evaluations. In that section, Shneiderman wrote the
following:
Scientific and engineering progress is often stimulated by improved
techniques for precise measurement. Rapid progress in interactive systems design will occur as soon as researchers and practitioners evolve
suitable human performance measures and techniques.... Academic
and industrial researchers are discovering that the power of traditional
scientific methods can be fruitfully employed in studying interactive
systems, (p. 411)

In the 1992 edition, there again was no entry in the index


for usability testing, but there is one for usability laboratories.
Shneiderman described usability tests but called them "pilot
tests." These tests "can be run to compare design alternatives,
to contrast the new system with current manual procedures, or
to evaluate competitive products" (p. 479).
In the 1997 edition, there is a chapter section on usability
testing and laboratories. Shneiderman wrote:
Usability-laboratory advocates split from their academic roots as these
practitioners developed innovative approaches that were influenced
by advertising and market research. While academics were developing
controlled experiments to test hypotheses and support theories, practitioners developed usability-testing methods to refine user interfaces
rapidly (p. 128)

This brief history shows that usability testing has been an


established evaluation method for only about 10 years. The research studies by Virzi (1990,1992) on the relatively small number of participants needed in a usability test gave legitimacy to
the notion that a usability test could identify usability problems
quickly. Both of the book length descriptions of usability testing
(Dumas & Redish, 1993; Rubin, 1994) explicitly presented usability testing as a method separate from psychological research.
Yet, as discussed later in this chapter, comparisons between usability testing and research continue. The remaining sections on
usability testing cover usability testing basics, important variations on the essentials, challenges to the validity of user testing,
and additional issues.
Valid usability tests have the following six characteristics.
The focus is on usability.
The participants are end users or potential end users.
There is some artifact to evaluate, such as a product design, a
system, or a prototype of either.
The participants think aloud as they perform tasks.
The data are recorded and analyzed.
The results of the test are communicated to appropriate
audiences.

The Focus Is on Usability


It may seem like an obvious point that a usability test should be
about usability, but sometimes people try to use a test for other,
inappropriate purposes or call other methods usability tests.
Perhaps the most common mismatch is between usability and

1097

marketing and promotional issues, such as adding a question to


a posttest questionnaire asking participants if they would buy
the product they just used. If the purpose of the question is to
provide an opportunity for the participant to talk about his or
her reactions to the test session, the question is appropriate.
But if the question is added to see if customers would buy the
product, the question is not appropriate. A six-participant usability test is not an appropriate method for estimating sales or
market share. Obviously, a company would not base its sales
projections on the results of such a question, but people who
read the test report may draw inappropriate conclusions, for
example, when the product has several severe usability problems, but five of the six participants say that they would buy it.
The participants' answers could provide an excuse for ignoring
the usability problems. It is best not to include such questions or
related ones about whether customers would use the manual.
The other common misconception about the purpose of a
test is to view it as a research experiment. The fact is, a usability
test looks like research. It often is done in a "lab," and watching
participants think out loud fits a stereotype some people have
about what a research study looks like. But a usability test is not
a research study (Dumas, 1999).
A Usability Test Is Not a Focus Group. Usability testing
sometimes is mistaken for a focus group, perhaps the most used
and abused empirical method of all time. People new to userbased evaluation jump to the conclusion that talking with users
during a test is like talking with participants in a focus group.
But a usability test is not a group technique, although two participants are sometimes paired, and a focus group is not a usability
test unless it contains the six essential components of a test.
The two components of a usability test that are most often missing from a focus group are (a) a primary emphasis is not on
usability and (b) the participants do not perform tasks during
the session.
The most common objective for a usability test is the diagnosis of usability problems. When testers use the term usability
test with no qualifier, most often they are referring to a diagnostic test. When the test has another purpose, it has a qualifier
such as comparison or baseline.
When Informal Really Means Invalid. One of the difficulties in discussing usability testing is finding a way to describe a
test that is somewhat different from a complete diagnostic usability test. A word that is often used to qualify a test is informal,
but it is difficult to know what informal really means. Thomas
(1996) described a method, called "quick and dirty" and
"informal," in which the participants are not intended users of
the product and in which time and other measures of efficiency
are not recorded. Such a test may be informal in some sense of
that word, but it is certainly invalid and should not be called a
usability test. It is missing one of the essentials: potential users.
It is not an informal usability test because it is not a usability test
at all. Still we need words to describe diagnostic tests that differ
from each other in important ways. In addition, tests that are
performed quickly and with minimal resources are best called
"quick and clean" rather than "informal" or "quick and dirty"
(Wichansky, 2000).

1098

DUMAS

The Participants Are End Users or Potential End Users

that 80% of the problems are uncovered with about five participants and 90% with about 10 continue to be confirmed (Law &
Vanderheiden, 2000).
What theses studies mean for practitioners is that, given a
sample of tasks and a sample of participants, just about all of
the problems testers will find appear with the first 5 to 10 participants. This research does not mean that all of the possible
problems with a product appear with 5 or 10 participants, but
most of the problems that are going to show up with one sample
of tasks and one group of participants will occur early.
There are some studies that do not support the finding that
small samples quickly converge on the same problems. Lewis
(1994) found that for a very large product, a suite of office productivity tools, 5 to 10 participants was not enough to find nearly
all of the problems. The studies by Molich et al. (1998, 2001)
also do not favor convergence on a common set of problems.
As I discuss later, the issue of how well usability testing uncovers the most severe usability problems is clouded by the
unreliability of severity judgments.

A valid usability test must test people who are part of the target
market for the product. Testing with other populations may be
useful, that is, it may find usability problems. But the results
cannot be generalized to the relevant populationthe people
for whom it is intended.
The key to finding people who are potential candidates for
the test is a user profile (Branaghan, 1997). In developing a profile of users, testers want to capture two types of characteristics:
those that the users share and those that might make a difference
among users. For example, in a test of an upgrade to a design for
a cellular phone, participants could be people who now own
a cell phone or who would consider buying one. Of the people who own a phone, you may want to include people who
owned the previous version of the manufacturer's phone and
people who own other manufacturers' phones. These characteristics build a user profile. It is from that profile that you create
a recruiting screener to select the participants.
A common issue at this stage of planning is that there are
more relevant groups to test than there are resources to test
them. This situation forces the test team to decide on which
group or groups to focus. This decision should be based on the
product management's priorities not on how easy it might be
to recruit participants. There is almost always a way to find the
few people needed for a valid usability test.

Recruiting Participants and Getting Them to Show Up. To


run the test you plan, you will need to find candidates and qualify
them for inclusion in the test. Usually, there are inclusion and
exclusion criteria. For example, from a user profile for a test
of an instruction sheet that accompanies a ground fault circuit
interrupter (GFCI)the kind of plug installed in a bathroom
or near a swimming poola test team might want to include
people who consider themselves "do-it-yourselfers" and who
would be willing to attempt the installation of the GFCI but
exclude people who actually had installed one before or who
were licensed electricians. The way the testers would qualify
candidates is to create a screening questionnaire containing the
specific questions to use to qualify each candidate. Then they

A Small Sample Size Is Still the Norm. The fact that usability testing uncovers usability problems quickly remains one
of its most compelling properties. Testers know from experience that in a diagnostic test, the sessions begin to get repetitive after running about five participants in a group. The early
research studies by Virzi (1990, 1992; see Fig. 56.1) showing
1.0
0.9

0.8
0.7
0.6

Proportion of
Problems Uncovered

0.5
0.4
0.3
0.2
0.1
0.0
5

10

15

20

Number of Participants in Test

FIGURE 56.1. An idealized curve showing the number of participants


needed to find various proportions of usability problems.

56. User-Based Evaluations


have to find candidates to recruit. It often takes some social skills
and a lot of persistence to recruit for a usability test. It takes a
full day to recruit about six participants. For the test of the GFCI
instruction sheet, they may have to go to a hardware store and
approach people who are buying electrical equipment to find
the relevant "do-it-yourselfers."
Many organizations use recruiting firms to find test participants. Firms charge about $100 (in year 2000 U.S. dollars) for
each recruited participant. But the testers create the screening
questions and test them to see if the people who qualify fit the
user profile.
To get participants to show up, the testers or the recruiting
firm need to do the following:
Be enthusiastic with them on the phone.
Offer them some incentive. Nothing works better than money,
about $50 to $75 an hour (in year 2000 dollars) for participants
without any unusual qualifications. Some testing organizations
use gift certificates or free products as incentives. For participants with unusual qualifications, such as anesthesiologists
or computer network managers, the recruiter may need to
emphasize what the candidates are contributing to their profession by participating in the test.
As soon as participants are qualified, send or fax or e-mail
citing the particulars discussed on the phone and a map with
instructions for getting to the site.
Contact participants one or two days before the test as a
reminder.
Give participants a phone number to call if they can't make
the session, need to reschedule, or will be late.
If testers follow all of these steps, they will still have a noshow rate of about 10%. Some organizations over recruit for
a test, qualifying some extra candidates to be backups in case
another participant is a no-show. A useful strategy can be to
recruit two participants for a session and, if both show up, run
a codiscovery session with both participants. See below for a
description of codiscovery.

There Is a Product or System to Evaluate


Usability testing can be performed with most any technology.
The range includes the following:
Products with user interfaces that are all software (e.g., a
database management system), all hardware (a high-quality
pen), and those that are both (a cell phone, a clock radio, a
hospital patient monitor, a circuit board tester, etc.)
Products intended for different types of users (such as consumers, medical personnel, engineers, network managers,
high school students, computer programmers, etc.)
Products that are used together by groups of users, such as
cooperative work software (Scholtz & Bouchette, 1995)
Products in various stages of development (such as userinterface concept drawings; early, low-tech prototypes; more

1099

fully functioning, high-fidelity prototypes; products in beta


testing, and completed products)
Components that are imbedded in or accompany a product
(such as print manuals, instruction sheets that are packaged
with a product, tutorials, quick-start programs, online help,
etc.)
Testing Methods Work Even With Prototypes. One of the
major advances in human-computer interaction over the last
15 years is the use of prototypes to evaluate user interface designs. The confidence that evaluators have in the validity of
prototypes has made it possible to move evaluation sooner and
sooner in the development process. Evaluating prototypes has
been facilitated by two developments: (a) using paper prototypes and (b) using software specifically developed for prototyping. Ten or more years ago, usability specialists wanted to
create prototyping tools that make it possible to save the code
from the prototype to use it in the final product. Developers
soon realized that a prototyped version of a design is seldom
so close to the final design that it is worth saving the code.
Consequently, the focus of new prototyping tools has been on
speed of creating an interactive design. In addition, the speed
with which these software tools and paper prototypes can be
created makes it possible to evaluate user interface concepts
before the development team gets so enamored with a concept
that they won't discard it.
There have been several studies that have looked at the validity of user testing using prototypes. These studies compare
paper or relatively rough drawings with more interactive and
polished renderings. The studies all show that there are few differences between high- and low-fidelity prototypes in terms of
the number or types of problems identified in a usability test or
in the ratings of usability that participants give to the designs
(Cantani & Biers, 1998; Landay & Myers, 1995; Virzi, Sokolov, &
Karis, 1996; Wiklund, Dumas, & Thurrot, 1992).

The Participants Think Aloud As They Perform Tasks


This is the execution phase of the test. It is where the test participant and the test administrator interact and it is where the
data are collected. Before the test session starts, the administrator gives a set of pretest instructions. The instructions tell the
participant how the test will proceed and that the test probes
the usability of the product, not their skills or experience.
One of the important parts of the pretest activities is the
instructions on thinking aloud. The administrator tells the participants to say out loud what they are experiencing as they
work. Interest in thinking aloud was revived with the recent
publication of two articles that were written independently at
almost the same time (Boren & Ramey, 2000; Dumas, 2001).
Both of these articles question assumptions about the similarity between the think-aloud method as used in usability testing
and the think-aloud method used in cognitive psychology research. Until these articles were published, most discussions of
the think-aloud method used in usability testing automatically
noted its superficial resemblance to the method described by
Ericsson and Simon (1993) to study human problem solving.

1 100

DUMAS

In both methods, the participants are taught to think aloud by


providing instructions on how to do it, showing an example of
the think aloud by giving a brief demonstration of it, and by
having the participant practice thinking aloud. But there ends
the similarity to research.
In cognitive psychology research, thinking aloud is used to
study what is in participants' short-term memory. Called Level 1
thinking aloud, the research method focuses on having the
participant say out loud what is in the participants short-term
memory, which can only occur when the participants' describe
what they are thinking as they perform cognitive tasks such as
multiplying two numbers. Participants are discouraged from reporting any interpretations of what is happening, any emotions
that accompany the task, and their expectations or violations
of them. The research method is thought, by its proponents, to
provide entree only into the short-term memory of the participants in the research.
In usability testing, the focus is on interactions with the object being tested and with reporting not only thoughts, but also
expectations, feelings, and whatever the participants want to
report. Reports of experiences other than thoughts are important because they often are indicators of usability problems.
This discrepancy between the two sets of think-aloud instructions led Boren and Ramey (2000) to look at how thinking aloud is used in user testing as well as practices related
to thinking aloud, such as how and when to encourage participants to continue to do it. They reported the results of observing test administrators implementing these practices and
how little consistency there is among them. Boren and Ramey
explored other aspects of the verbal communication between
administrators and participants, including how to keep the
participants talking without interfering with the think-aloud
process.
Bowers and Snyder (1990) conducted a research study
to compare the advantages and disadvantages of having test
participants think out loud as they work, called concurrent
thinking aloud, with thinking out loud after the session, called
retrospective thinking aloud. In the retrospective condition, the
participants performed tasks in silence then watched a videotape of the session while they thought aloud. This is an interesting study because of its implications for usability testing. The
group of participants who performed concurrent thinking aloud
were not given typical think-aloud instructions for a usability
test. Instead, their instructions were typical of a think aloud research study. Participants were told to "describe aloud what they
are doing and thinking." They were not told to report any other
internal experiences. In addition, they were never interrupted
during a task. There was no probing. Any encouragement they
needed to keep talking was only done between tasks. The retrospective participants were told that they would be watching
the videotape of the session after the tasks and would be asked
to think aloud then.
There were several interesting results. First, there were no
differences between the concurrent and retrospective groups
in task performance or in task difficulty ratings. The thinking
aloud during the session did not cause the concurrent group to
take more time to complete tasks, to complete fewer tasks, or to
rate tasks as more difficult in comparison with the performance

of the retrospective participants. These findings are consistent


with the results from other think-aloud research studies,
although in some studies thinking aloud does take longer.
The differences between the groups were in the types
of statements the participants made when they thought out
loud. The concurrent group verbalized about 4 times as many
statements as the retrospective group, but the statements were
almost all descriptions of what the participants were doing or
reading from the screen. The participants who did concurrent
thinking aloud were doing exactly as they were instructed;
they were attending to the tasks and verbalizing a kind of
"play-by-play" of what they were doing. The participants in the
retrospective condition made only about one fourth as many
statements while watching the tape, but many more of the
statements were explanations of what they had been doing
or comments on the user interface design. "The retrospective
subjects... can give their full attention to the verbalizations
and in doing so give richer information" (Bowers & Snyder,
1990, p. 1274).
This study shows us what would happen if we tried to get participants in usability tests to report only Level 1 verbalizations
and did no probing of what there were doing and thinking. Their
verbalizations would be much less informative. The study does
show that retrospective thinking aloud yields more diagnostic
verbalizations, but it takes 80% longer to have the participants do
the tasks silently then to think out loud as they watch the tape.
There have been two fairly recent studies that compared a
condition in which the tester could not hear the think aloud
of participants with a condition in which they could (Lesaigle
& Biers, 2000; Virzi, Sorce, & Herbert, 1993). In the Virzi et al.
study, usability professionals who recorded usability problems
from a videotape of test participants thinking aloud, were compared with usability professionals who could see only the performance data of the test participants. Those who had only
the performance data uncovered 46% of the problems with the
product, whereas those seeing the think aloud condition uncovered 69%. In the Lesaigle and Biers study, usability professionals who could see a video of only the screens the participants
could see were compared with comparable professionals who
could see the screens and hear the participants think aloud. The
results showed that there were fewer problems uncovered in
the screen only condition compared with the screen plus thinkaloud condition. Both of these studies suggest that in many cases
participants' think-aloud protocols provide evidence of usability
problems that do not otherwise show up in the data.
Conflict in Roles. Dumas (2001) explored how difficult it
can be for test administrators to keep from encouraging or discouraging participants' positive or negative statements. He saw
a conflict in two roles that administrators play: (a) the friendly
facilitator of the test and (b) the neutral observer of the interaction between the participant and the product.
The friendly facilitator role and the neutral observer role
come into conflict when participants make strong statements
expressing an emotion such as, "I hate this program!" Almost
anything the test administrator says at that point can influence
whether the participants will report more or fewer of these
negative feelings. Consider the following statements:

56. User-Based Evaluations


"Tell me more about that": relatively neutral in content but could
be interpreted as encouraging more negative statements
"That's great feedback": again relatively neutral to someone
who has training in test administration but, I believe, sounds
evasive to participants
"Those are the kinds of statements that really help us to understand how to improve the product": reinforcing the negative
"I really appreciate your effort to help us today": says nothing
about the content of what the participant said and is part of
playing the friendly role with participants. Will the participant hear it that way?
Silence: Neutral in content, but how will it be interpreted? In human interaction, one person's silence after another's strong
statement is almost always interpreted as disagreement or
disapproval. Without any other instructions, the participant
is left to interpret the test administrator's silenceyou don't
care, you don't want negative comments, strong feelings are
inappropriate in this kind of test, and so on.
Not all of the biasing responses to emotional statements are
verbal. If the tester is in the room with the participant and
takes notes when participants make an emotional statement,
he or she may be reinforcing them to make more. Any of these
responses could push test participants to utter more or fewer
strong feelings. Dumas suggested that one way to avoid this
conflict in roles is to tell participants what the two roles are in
the pretest instructions.
The Special Case of Speech-Based Products. For the most
part, the basic techniques of user testing apply to speech applications. There are a few areas, however, where testers may
need to modify their methods (Dobroth, 1999):
It is not possible for test participants to think aloud while
they are using a speech recognition application because talking
interferes with using the application. If participants were to
speak aloud, the verbalized thoughts may be mistaken for input
by the speech recognizer. Moreover, if participants think aloud
while using a speech application, they may not be able to hear
spoken prompts. One way to get around this problem is to have
participants comment on the task immediately after finishing
it. This works well for tasks that are short and uncomplicated.
If tasks are longer, however, participants will begin to forget
exactly what happened in the early parts of the task. In this case,
the test administrator can make a recording of the participants'
interaction with the system as they complete the task. At the
end of the task, participants listen to the recording and stop it to
comment on parts of the interaction that they found either clear
or confusing. Both of these solutions provide useful information
but add a substantial amount of time to test sessions.
Evaluating speech-based interfaces often is complicated by
the presence of a recognizer in the product. The recognizer interprets what the test participant says. Often the recognizer
can't be changed; it is the software that surrounds it that is being tested. Using a poor recognizer often clouds the evaluation
of the rest of the software. In a "Wizard-of-Oz" test (see chapter
52 for more on this technique), the test administrator creates
the impression in participants that they are interacting with a

1101

voice response system. In reality, the flow and logic of each interaction is controlled by the test administrator, who interprets
participants' responses to prompts, and responds with the next
prompt in the interaction. Using this method also allows the
administrator to be sure that error paths are tested. In a speech
interface, much of the design skill is in dealing with the types
of errors that recognizers often make.
In the past, it was much more difficult to create a prototype of a speech-based product, but several options are now
available. Another option is to use the speech capabilities of
office tools such as Microsoft's PowerPoint.
Selecting Tasks. One of the essential requirements of every
usability test is that the test participants attempt tasks that users
of the product will want to do. When a product of even modest
complexity is tested, however, there are more tasks than there is
time available to test them. Hence the need to select a sample of
tasks. Although not often recognized as a liability of testing, the
sample of tasks is a limitation to the scope of a test. Components
of a design that are not touched by the tasks the participants
perform are not evaluated. This limitation in thoroughness is often why testing is combined with usability inspection methods,
which have thoroughness as one of their strengths.
In a diagnostic test, testers select tasks for several reasons:
They include important tasks, that is, tasks that are performed frequently or are basic to the job users will want to
accomplish, and tasks, such as log in or installation, that are
critical, if infrequent, because they affect other tasks. With almost any product there is a set of basic tasks. Basic means tasks
that tap into the core functionality of the product. For example, a nurse using a patient monitor will frequently look to see
the vital sign values of the patient and will want to silence any
alarms once she or he determines the cause. In addition, the
nurse will want to adjust the alarm limits, even though the limit
adjustment may be done infrequently. Consequently, viewing
vital signs, silencing alarms, and adjusting alarm limits are basic
tasks.
They include tasks that probe areas where usability problems are likely. For example, if testers think that users will have
difficulty knowing when to save their work, they may add saving
work to several other tasks. Selecting these kinds of tasks makes
it more likely that usability problems will be uncovered by the
test, an important goal of a diagnostic test. But including these
kinds of tasks makes it likely that a diagnostic test will uncover
additional usability problems. In effect, these tasks pose a more
difficult challenge to a product than if just commonly done or
critical tasks are included. These tasks can make a product look
less usable than if they were not included. As we will see below,
this is one of the reasons why a diagnostic test does not provide
an accurate measure of a product's usability.
They include tasks that probe the components of a design.
For example, tasks that force the user to navigate to the lowest
level of the menus or tasks that have toolbar shortcuts. The goal
is to include tasks that increase thoroughness at uncovering
problems. When testing other components of a product, such
as a print manual, testers may include tasks that focus on what

1102

DUMAS

is in the manual, such as a task that just asks the participant to


locate a number of items (Branaghan, 1998).
Some additional reasons for selecting tasks are:
They may be easy to do because they have been redesigned
in response to the results of a previous test.
They may be new to the product line, such as sending an order
for a drug to the hospital pharmacy.
They may cause interference from old habits, such as a task
that has been changed from a previous release of the product.
With so many reasons for selecting tasks, paring the task
list to the time available is an important part of test planning.
Typically testers and developers get together in the early stages
of test planning to create a task list. In addition to including tasks
in the list, the testers need to make some preliminary estimate
of how long each task will take. The time estimate is important
for deciding how many tasks to include, and it may also be useful
for setting time limits for each task. Even in a diagnostic test,
time limits are useful because testers want participants to get
through most of the tasks. Setting time limits is always a bit
of a guess. Until you conduct a pilot test, it is difficult to make
accurate estimates of time limits, but some estimate is necessary
for planning purposes.
The Tasks Are Presented in Task Scenarios. Almost without
exception, testers present the tasks that the participants do in
the form of a task scenario. For example:
You've just bought a new combination telephone and answering machine. It is in the box on the table. Take the product out of the box and
set it up so that you can make and receive calls.

A good scenario is short, in the user's words not the product's, unambiguous, and gives participants enough information
to do the task. It never tells the participant how to do the task.
From the beginning, usability testers recognized the artificiality of the testing environment. The task scenario is an attempt to
bring a flavor of the way the product will be used into the test.
In most cases, the scenario is the only mechanism for introducing the operational environment into the test situation. Rubin
(1994, p. 125) describes task scenarios as adding context and
the participant's rationale and motivation to perform tasks. "The
context of the scenarios will also help them to evaluate elements
in your product's design that simply do not jibe with reality"
and "The closer that the scenarios represent reality, the more
reliable the test results" (emphasis added). Dumas and Redish
(1999, p. 174) said, "The whole point of usability testing is to
predict what will happen when people use the product on their
own.... The participants should feel as if the scenario matches
what they would have to do and what they would know when
they are doing that task in their actual jobs" (emphasis added).
During test planning, testers work on the wording of each
scenario. The scenario needs to be carefully worded so as not
to mislead the participant to try to perform a different task.
Testers also try to avoid using terms in the scenario that give

the participants clues about how to perform the task, such as


using the name of a menu option in the scenario.
In addition to the wording of the task scenarios, their order
may also be important. It is common for scenarios to have dependencies. For example, in testing a cellular phone there may
be a task to enter a phone number into memory and a later task
to change it. A problem with dependencies happens when the
participant can't complete the first task. Testers have developed
strategies to handle this situation, such as putting a phone number in another memory location that the test administrator can
direct the participants to when they could not complete the
earlier task.
Testers continue to believe in the importance of scenarios
and always use them. There is no research, however, showing
that describing tasks as scenarios rather than simple task statements makes any difference to the performance or subjective
judgments of participants. But taking note of the product's use
environment may be important, as described in the next section.

The Participants Are Observed, and Data


Are Recorded and Analyzed
Capturing Data As They Occur. Recording data during the
session remains a challenge. All agree that testers need to plan
how they will record what occurs. There are too many events
happening too quickly to be able to record them in free-form
notes. The goal is to record key events while they happen rather
than having to take valuable time to watch videotapes later.
There are three ways that testers deal with the complexity of
recording data:
Create data collection forms for events that can be anticipated
(Kantner, 200la).
Create or purchase data logging software (Philips & Dumas,
1990).
Automatically capture participant actions in logfiles(Kantner,
2001b) or with specialized software (Lister, 2001).
Figure 56.2 shows a sample data collection form for the task
of saving a file to a diskette in Microsoft Windows. Notice that
it is set up to capture both paths to success and paths to failure.
The form also allows for capturing a Reject, which is a task that
a participant considers complete but the data collector knows
is not. Rejects are important to note because, although they
are failures, they often have task times that are faster than even
successful tasks.
The use of data logging software continues at many of the
larger testing facilities. Most have created their own logging
software. Logging test activities in real time, however, continues
to be a messy process. Testers almost always have to edit the data
log after each session to remove errors and misunderstandings,
such as when a task was really over.
It is difficult to use forms or software when the administrator is sitting in the test room beside the participant and is
conducting the test alone. Without a data recorder, it is difficult,
but still possible, to sit in the test room with the test participant
and to record data at the same time.

56. User-Based Evaluations

1103

Task 1. Copy a Word file to a diskette


Pass

(Time

Explorer:
Dragged file from one Explorer pane to another with
File:
Send to: Floppy A
Copied
and Pasted in Explorer with
Toolbar

left

right button

Edit menu

with Keyboard

My Documents:
Dragged to Desktop then back with
left
right button
CTRL
D File: Send to: Floppy A
Copied
and Pasted with Toolbar
Edit menu with Keyboard
Word
Opened

Word and chose File: Save as

Fail or Reject (Time_


Word
Chose

Help

Save

Windows

____Word

Topic:

FIGURE 56.2. Sample data collection form.


Collecting data is a special challenge with Web-based products. There are so many links and controls on a typical Web page
that it is difficult to record what is happening short of watching the session again on videotape. This difficulty has renewed
interest in automatic data collection. But the tools to do this
capture usually record data that are at too low a level to uncover usability problems. Most usability problems don't need to
be diagnosed at the mouse click or key press level.
Getting Developers and Managers to Watch Test Sessions.
One of the important assets of testing is that it sells itself. Watching even a few minutes of live testing can be very persuasive.
There are two reasons why testers need to get key project staff
and decision makers to come to watch a test session:
When people see their first live test session they are almost
always fascinated by what they see. They gain understanding of
the value of the method. Watching a videotape of a session does
not provide the same experience. Expend whatever effort it
takes to get these people to attend test sessions.
When developers see live sessions, it is much easier to
communicate the results to them. When they have seen some
of the usability problems themselves, they are much less likely

to resist agreeing on what the most important problems are.


Some of them will even become advocates for testing.
Even though testing is known and accepted by a much wider
circle of people than it was 10 years ago, the experience of
watching a user work at a task while thinking aloud still converts more people to accept usability practices than any other
development tool.
Participant and Administrator Sit Together. Most usability tests are run with a single test participant. Studies show
that when two participants work together, sometimes called
the codiscovery method (Kennedy, 1989), they make more utterances. The nature of the utterances also is different, with
codiscovery participants making more evaluative, as opposed
to descriptive statements and making more statements that developers view as useful (Hackman & Biers, 1992). But using
codiscovery does require recruiting twice as many participants.
A related method is to have one participant teach another how
to do a task (Vora, 1994).
The Usability Lab Is Now Ubiquitous. Usability labs continue to be built, and there is a brisk business in selling lab

1 104

DUMAS

equipment. The demand for labs is driven by the advantages of


having recording equipment and the ability to allow stakeholders to view the test sessions. In essence, the method sells itself
in the sense that developers and managers find compelling the
experience of watching a live test session. A testing facility, especially one with a one-way mirror, adds a sense of scientific
credibility to testing, which, as we will discuss below, may be a
false sense.
The basic makeup of a suite of usability test equipment has
not changed much with time. It consists of video and audio
recording equipment and video mixing equipment. For testing
products that run on general-purpose computer equipment, a
common setup is a scan converter showing what is on the test
participant's screen and a video camera focused on the face or
head and shoulders of the participant.
There are some recent innovations in lab equipment that are
enhancing measurement. Miniaturization continues to shrink
the size of almost all lab equipment, hence the arrival of portable
lab setups that fit in airplane overhead compartments. Relatively inexpensive eye-tracking equipment has made it possible
to know where participants are looking as they work.
The quality of video images recorded during sessions has
always been poor. Second-generation copies, which are often
used in highlight tapes, make the quality of highlight tapes even
poorer. Scan converters selling for under $2,000 produce surprisingly poor images, making it difficult to see screen details.
The move to digital video and inexpensive writeable CDs
promises to improve recordings and to make it substantially
easier to find and edit video segments.

Mimicking the Operational Environment. Testers often


make changes to the setup of the test room. Rubin (1994,
p. 95) describes the requirements for the testing environment as
follows: "Make the testing environment as realistic as possible.
As much as possible, try to maintain a testing environment that
mimics the actual working environment in which the product
will be used." But is putting a couch in the test room to make
it look more like a room in a home simulating the use environment? It may not be, but going to the participant's home for testing is a complex process (Mitropoulos-Rundus & Muzak, 1997).
The literature on product evaluation, when viewed from the
perspective of 50 years, shows that in complex operational environments, researchers and practitioners have used software
simulations or hardware-software simulators to mimic that operational environment. For example, aircraft and automobile
simulators are used to study interactions with cockpits and dashboards as well as for operator training. More recently, hospital
operating-room simulators have been developed to study equipment interaction issues in anesthesiology (Gaba, 1994).
A variable, usually called fidelity, is used to describe the degree to which simulations or simulators mimic the operational
environment. In those interactions between users and aircraft,
automobiles, and operating rooms, the environment is so important that simulations are needed to mimic it. There may be
other environments that influence the usability of the products
we test, and we need to think more about the fidelity of our testing environments (Wichansky 2000). I discuss this issue further
in the section Challenges to the Validity of Testing.

The Impact of the Testing Equipment. An issue that has


been debated throughout the history of usability testing is the
impact of one-way mirrors and recording equipment on the test
participants. This debate comes to a head in discussions about
whether the test administrator should sit with participants as
they work or stay behind the one-way mirror and talk over an
intercom. Some testing groups always sit with the participant,
believing that it reduces the participants' anxiety about being
in the test and makes it easier to manage the session (Rubin,
1994). Other testing groups normally do not sit with the participants, believing that it makes it easier to remain objective and
frees the administrator to record the actions of the participants
(Dumas & Redish, 1999). There is one study that partially addressed this issue. Barker and Biers (1994) conducted an experiment in which they varied whether there was a one-way mirror
and cameras in the test room. They found that the presence of
the equipment did not affect the participants' performance or
ratings of usability of the product.
Remote Testing. Remote usability testing refers to situations in which the test administrator and the test participant
are not at the same location (Hartson, Castillo, Kelso, Kamler,
& Neale, 1996). This can happen for a number of reasons, such
as testing products used by only a few users who are spread
throughout the country or the world. Products such as NetMeeting software make it possible for the tester to see what is
on the participant's screen and, with a phone connection, hear
the participants think aloud.
There are other technologies that can provide testers with
even more information, but they often require both parties to
have special video cards and software. There are also technologies for having ratings or preference questions pop up while
participants are working remotely (Abelow, 1992). But Lesaigle
and Biers' (2000) study showed that uncovering problems only
through participants' questionnaire data had the least overlap
with conditions in which testers uncovered problems by watching the participants work or seeing the screens on which participants. They concluded that "the present authors are skeptical
about using feedback provided by the user through online questionnaires as the sole source of information" (p. 587). Still, no
one would disagree that some remote testing is better than no
testing at all.
The primary advantages of remote testing include the following:
Participants are tested in an environment in which they are
comfortable and familiar.
Participants are tested using their own equipment environment.
Test costs are reduced because participants are easier to recruit, do not have to travel, and often do not have to be compensated. In addition, there are no test facility costs.
But there can be disadvantages to remote testing:
With live testing, the viewer or conferencing software can
slow down the product being tested.
Company firewalls can prevent live testing. Most viewers and
meeting software often cannot be used if there is a firewall.

56. User-Based Evaluations


Perkins (2001) described a range of remote usability testing options that includes user-reported critical incidents, embedded
survey questions, live remote testing with a viewer, and live
remote testing with conferencing software.

Measures and Data Analysis


In this section, I discuss test measures, discrepancies between
measures, and data analysis.
Test Measures. There are several ways to categorize the
measures taken in a usability test. One is to break them into
two groups: (a) performance measures, such as task time and
task completion, and (b) subjective measures, such as ratings
of usability or participants' comments. Another common breakdown uses three categories: (a) efficiency measures (primarily task time), (b) effectiveness measures (such as task success), and (c) satisfaction measures (such as rating scales and
preferences).
Most performance measures involve time or simple counts
of events. The most common time measure is time to complete
each task. Other time measures include the time to reach intermediate goals such as the time for events such as to finding an
item in Help. The counts of events in addition to task completion include the number of various types of errors, especially
repeated errors, and the number of assists. An assist happens
when the test administrator decides that the participant is not
making progress toward task completion, but the administrator
can continue to learn more about the product by keeping the
participant working on the task. An assist is important because
it indicates that there is a usability problem that will keep participants from completing a task. The way assists are given to
participants by test administrators, part of the art of running a
test, is not consistent from one usability testing organization to
another (Boren & Ramey, 2000).
There are some complex measures that are not often used
in diagnostic tests but are sometimes used in comparison tests.
These measures include the time the participant works toward
the task goal divided by the total task time (sometimes called
task efficiency) and the task time for a participant divided by the
average time for some referent person or group, such as an expert or an average user. It seems only natural that an important
measure of the usability of a product should be the test participants' opinions and judgments about the ease or difficulty of
using it. The end of the test session is a good time to ask for
those opinions. The participant has spent an hour or two using
the product and probably has as much experience with it as he
or she is likely to have. Consequently, a posttest interview or a
brief questionnaire is a common subjective measure (see, however, the discussion about the discrepancies between measures
in the next subsection).
Eye tracking is a relatively new measure in user testing. Eyetracking equipment has come down in price in recent years.
You can purchase a system for about $40,000, plus or minus
$ 10,000, depending on accessories and data reduction software.
The new systems are head mounted and allow the test participant a good deal of movement without losing track of where
the eye is looking.

1105

Interest in where participants are looking has increased with


the proliferation of Web software. There are so many links and
controls on a Web page that it can be difficult to know exactly
where participants are looking. An eye tracker helps solve that
problem. Not all test participants can be calibrated on an eye
tracker; up to 20 percent of typical user populations cannot be
calibrated because of eye abnormalities.
The data from the tracker is broken into fixations300millisecond periods during which the point of regard doesn't
move more than 1 of visual angle. Each fixation has a start time,
duration, point of gaze coordinates, and average pupil diameter.
Eye movement analysis involves looking at fixations within an
area of interest (AOI), which is a tester-defined area on a screen.
These regions usually define some object or control on a page.
There are AOIs on each page and testers use eye-tracking software to compute statistics such as the average amount of time
and the number and duration of fixations in each AOI. Then
there are statistics that measure eye movements from one AOI
to another and plots of scan paths.
Eye tracking systems produce a great deal of data; a 60-Hz
tracker produces 3,600 records a minute. Consequently, data
reduction becomes a major task. Consequently, eye tracking
isn't something that is used without a specific need. Goldberg
(2000) identified evaluation criteria that can benefit most from
eye tracking data, with visual clarity getting the most benefit.
But there are other evaluation areas that are likely to benefit as
tracking technology becomes cheaper and easier to manage.
Discrepancies Between Measures. Some investigators find
only a weak correlation between efficiency measures and effectiveness measures. Frokjaer, Hertzum, and Hornbaek (2000)
described a study in which they found such a weak correlation.
They then went back and looked at several years of usability test
reports in the proceedings of the annual CHI conference. They
noted that it is common for testers to report only one category
of performance measure and caution not to expect different
types of measures to be related.
There is a vocal minority of people writing about usability
testing measures who argue against the use of quantitative performance measures in favor of a qualitative analysis of test data.
Hughes (1999) argued that qualitative measures can be just as
reliable and valid as qualitative measures.
A common finding in the literature is that performance
measures and subjective measures are often weakly correlated.
Lesaigle and Biers (2000) compared how well testers uncovered
usability problems under a number of conditions:
They only can see the screen the participant sees.
They can see the screen and hear the participants think aloud.
They can see screens and hear the think aloud and see the
participants face.
They see only the responses to questionnaire items.
The results show that uncovering problems only through
participants' questionnaire data had the least overlap with the
other three conditions. The authors concluded that "questionnaire data taps a somewhat different problem set," and "the questionnaire data was less likely to reveal the most severe problems

1 106

DUMAS

(p. 587). Bailey (1993) and Ground and Ensing (1999) both reported cases in which participants perform better with products
that they don't prefer and vice versa. Bailey recommended using
only performance measures and not using subjective measures
when there is a choice.
One of the difficulties with test questions is that they are
influenced by factors outside of the experience that participants have during the test session. There are at least three
sources of distortions or errors in survey or interview data:
(a) the characteristics of the participants, (b) the characteristics of the interviewer or the way the interviewer interacts
with the participant, and (c) the characteristics of the task situation itself. Task-based distortions include such factors as the
format of questions and answers, how participants interpret the
questions, and how sensitive or threatening the questions are
(Bradburn, 1983). In general, the characteristics of the task situation produce larger distortions than the characteristics of the
interviewer or the participant. Orne (1969) called these task
characteristics the "demand characteristics of the situation."
(See Dumas, 1998b, 1998c, for a discussion of these issues in
a usability testing context.) In addition to the demand characteristics, subjective measures can be distorted by events in the
test, such as one key event, especially one that occurs late in
the session.
Creating closed-ended questions or rating scales that probe
what the tester is interested in is one of the most difficult challenges in usability test methodology. Test administrators seldom
have any training in question development or interpretation.
Unfortunately, measuring subjective states is not a knowledge
area where testers' intuition is enough. It is difficult to create
valid questions, that is, questions that measure what we want
to measure. Testers without training in question development
can use open-ended questions and consider questions as an opportunity to stimulate participants to talk about their opinions
and preferences.
Testers often talk about the common finding that the way
participants perform using a product is at odds with the way
the testers themselves would rate the usability of the product.
There are several explanations for why participants might say
they liked a product that, in the testers eyes, was difficult to use.
Most explanations point to a number of factors that all push user
ratings toward the positive end of the scale. Some of the factors
have to do with the demand characteristics of the testing situation, for example, participants' need to be viewed as positive
rather than negative people or their desire to please the test administrator. Other factors include the tendency of participants
to blame themselves rather than the product and the influence
of one positive experience during the test, especially when it
occurs late in the session.
Test participants continue to blame themselves for problems
that usability specialists would blame on the user interface.
This tendency seems to be a deep-seated cultural phenomenon
that doesn't go away just because a test administrator tells the
participant during the pretest instructions that the session is not
a test of the participants' knowledge or ability. These positive
ratings and comments from participants often put testers in a
situation in which they feel they have to explain away participants' positive judgments with the product. Testers always feel

that the performance measures are true indicators of usability,


whereas subjective statements are unreliable. For example, a
very long task time or a failure to complete a task is a true measure of usability, whereas a positive rating of six out of seven on
usability is inflated by demand characteristics.
Data Analysis. Triangulation of measures is critical. It is
rare that a usability problem affects only one measure. For example, a poorly constructed icon toolbar will generate errors
(especially picking the wrong icon on the toolbar), slow task
times (during which participants hesitate over each icon and
frequently click through them looking for the one they want)
and statements of frustration (participants express their feelings
about not being able to learn how the icons are organized or be
able to guess what an icon will do from the tool tip).
Much of the data analysis involves building a case for a usability problem by combining several measures, a process that
is called triangulation (Dumas & Redish, 1999). The case building is driven by the problem list created during the test sessions.
It is surprising how much of this analysis is dependant on the
think-aloud protocol. We depend on what participants say to
help us understand what the problem is.
Identifying usability problems is key. Most usability problems
do not emerge from the analysis of the data after the test. The
problems are observed during the sessions and are recorded
on problem sheets or data logs. Later, the problem sheet or log
drives the data analysis. The problem sheet is usually created
by the test administrator during the test sessions or immediately afterward. The sheet is organized by participant and by
task. What gets recorded on the sheet are observations, such as
"didn't see the option," and interpretations, such as "doesn't understand the graphic." When the same problem appears again,
it is noted.
Experienced usability testers see the basic causes of problems. From individual instances of problems, the experienced
tester sees patterns that point to more general problems. For
example, a tester might see instances of participants spending
time looking around the screen and aimlessly looking through
menu options and conclude that "the participants were overwhelmed with the amount of information on the screen." From
a number of instances of participants not understanding terms,
the tester might conclude "the interface has too much technical
and computer jargon." From a number of instances of participants doing a task twice to make sure it was completed, the
tester might conclude that "there is not enough feedback about
what the system is doing with the participant's actions." Seeing
the underlying causes of individual problem tokens is one of the
important skills that a usability tester develops. It is not entirely
clear that such skills can be taught quickly. Testers often have
years of experience studying and practicing problem identification skills. But do experienced testers see the same problems
and causes? As I discuss later, there is some doubt about the
consistency of problem labeling.
While watching a test session, a product developer will see
the same events or tokens as the test administrator. But developers tend to see all problems as local. Instead of seeing that
there needs to be a general review of the language in the interface, the developer sees problems with individual words. This

56. User-Based Evaluations


conflict often doesn't appear until the testers and developers
sit down to discuss what they saw and what to do about it. Usability professionals believe that this conflict over what "really"
happened during the test remains a major barrier to improving
a product's usability. Handling this conflict takes some diplomacy. Developers don't like to be told that they have tunnel
vision and can't see the underlying causes of individual tokens,
and usability professions don't like hearing that the local fix will
solve the problem. This conflict continues to limit the impact
of testing on product improvement.
There have been several research studies that have looked at
how many usability problems are uncovered by different populations. These studies consistently show that usability specialists
find more problems than product developers or computer scientists. But all of the studies have used inspection evaluation
methods not user-based evaluation methods.
One of the issues still being debated about usability problems
is whether to place them into a company's software bug tracking
system (Wilson & Coyne, 2001). Putting them into the system
can be effective if the bugs are more likely to be fixed. But fitting
the bugs into a bug severity rating scale often is difficult, and
there is always a risk that the fix will solve only the local impact
of the problem not its basic structural cause. Some bug tracking
systems require that a bug be assigned only one cause, which
would not adequately describe many usability problems.
One way to call attention to important problems is to put
them into a measurement tool such as a problem severity scale.
These scales determine which problems are the most severe
and, presumably, more likely to be candidates to be fixed. There
have been several recent research studies that have looked at
the validity the reliability of these scales.
A disappointing aspect of this research is the lack of consistency in severity judgments. This lack appears in all forms
of usability evaluation, inspection and user based, and is one of
the most important challenges to usability methodology. Several
practitioners have proposed severity rating schemes: Nielsen
(1992), Dumas and Redish (1999), Rubin (1994), and Wilson and
Coyne (2001). The schemes have three properties in common:
1. They all use a rating scale that is derived from software
bug reporting scales. The most severe category usually involves
loss of data or task failure and the least severe category involves
problems that are so unimportant that they don't need an immediate fix. All of the authors assume that the measurement level
of their scale is at least ordinal, that is, the problems gets worse
as the scale value increases. The middle levels between the extremes are usually difficult to interpret and are stated in words
that are hard to apply to specific cases. For example, Dumas and
Redish proposed two middle levels: (a) problems that create significant delay and frustration and (b) problems that have a minor
effect on usability. Nielsen's middle levels are (a) major usability
problem (important to fix and so should be given high priority)
and (b) minor usability problem (fixing is given low priority).
Practitioners are not given any guidance on how problems fit
into the scale levels, especially the middle ones.
2. All of the authors admit, at least indirectly, that their scales
alone are not enough to assess severity. The authors propose one
or more additional factors for the tester to consider in judging

1107

severity. For example, Nielsen (1992) described four factors in


addition of the severity rating itself: frequency, impact, persistence, and something called "market impact." Rubin (1994) proposed multiplying the rating by the number of users who have
the problem. Dumas and Redish (1999) added a second dimension: the scope of the problem from local to global, with no
levels in between. With the exception of Rubin's multiplication
rule, none of these other factors are described in enough detail to indicate how their combination with the severity scale
would work, which is, perhaps, an indicator of the weakness of
the severity scales themselves.
3. None of the scales indicate how to treat individual differences. For example, what does one do if only two of eight
participants cannot complete a task because of a usability problem. Is that problem in the most severe category or does it move
down a level? If a problem is global rather than local, does that
change its severity? The authors of these scales provide little
guidance.
There have been a number of research studies investigating the consistency of severity ratings. These studies all show
that the degree of consistency is not encouraging. Most studies
have looked at the inconsistencies among experts using severity scales with inspection methods such as heuristic evaluation.
But Jacobsen and John (1998) showed that it also applies to
usability testing. They asked four experienced usability testers
to watch tapes of the same usability test and then identify problems, including the top-10 problems in terms of severity. Of
the 93 problems identified with the product, only 20% were
detected by all evaluators, whereas 46% were only found by a
single evaluator. None of the top-10 severe problems appeared
on all four evaluators' lists.
Lesaigle and Biers (2000) reported a disappointing correlation coefficient (0.16) among professional testers' ratings of the
severity of the same usability problems in a usability test. They
used Nielsen's severity rating scale. Cantani and Biers (1998)
found that heuristic evaluation and user testing did not uncover
the same problems, and that severity ratings of usability professionals did not agree with each other.
The results of these studies cast doubt on one of the most
often-mentioned assets of usability testing: its touted ability to
uncover the most severe usability problems.

Communicating Test Results


In the early days of user testing, there almost always was a formal test report. Testers needed reports to communicate what
they did, what they found, and what testing was all about. Now
it is more common for the results of a test to be communicated more informally, such as at a meeting held soon after the
last test session. Communication at these meetings is facilitated
when the product team has attended at least some of the test
sessions.
One of the important reasons for the change in reporting
style for diagnostic usability tests is the confidence organizations
have in the user testing process. It now is less often necessary to
write a report to justify conducting the test. Organizations with

1 108

DUMAS

active usability programs have come to accept user testing as a


valid and useful evaluation tool. They don't feel that they need
to know the details of the test method and the data analysis procedures. They want to know the bottom line: What problems
surfaced, and what should they do about them? In these organizations, a written report may still have value but as a means of
documenting the test.
The Value of Highlight Tapes. A highlight tape is a short,
visual illustration of the 4 or 5 most important results of a test.
In the early days of testing, almost every test had a highlight
tape, especially a tape aimed at important decision makers who
could not attend the sessions. These tapes had two purposes:
to show what happed during the test in an interesting way and
to illustrate what a usability test is and what it can reveal.
As usability testing has become an accepted evaluation tool,
the second purpose for highlight tapes has become less necessary. One of the disappointing aspects of highlight tapes is that
watching them does not have the same impact as seeing the
sessions live. Unless the action moves quickly, even highlight
tapes can be boring. This characteristic makes careful editing
of the highlights a must. But if the editing system is not digital, it takes about 1 hour to create 1 minute of finished tape. A
15 minute tape can take 2 days to create, even by an experienced
editor. Most of that time is taken finding appropriate segments
to illustrate key findings. The emergence of digital video will
make highlight tapes less time-consuming.
Some testers use the capabilities of tools such as PowerPoint
to put selections from a videotape next to a bullet in a slide
presentation rather than having a separate highlight tape. Others
have begun to store and replay video in a different way. There
are video cards for personal computers that will take a feed from
a camera and store images in mpeg format on a compact disk
(CD). Each CD stores about an hour of taping. A tester can then
show an audience the highlights by showing segments of the
CDs in sequence, thus eliminating the need for editing. Because
the cost of blank CDs is only about a dollar, they are cheaper to
buy than videotapes and take up less storage space.

VARIATIONS ON THE ESSENTIALS


In this section, I discuss aspects of usability testing that go beyond the basics of a simple diagnostic test. The section includes
measuring and comparing usability, baseline usability test, and
allowing free exploration.

Measuring and Comparing Usability


A diagnostic usability test is not intended to measure usability as much as to uncover as many usability problems as it
can. It doesn't directly answer the question, "How usable is
this product?" It would be wonderful to be able to answer that
question with a precise, absolute statement such as, "It's very
usable" or better, "Its 85% usable." But there is no absolute measure of usability, and without a comparative yardstick it is difficult to pinpoint a product's usability.

It would be ideal if we could say that a product is usable


if participants complete 80% of their tasks and if they give it
an average ease-of-use rating of 5.5 out of 7, with 7 being very
usable. But all tasks and tests are not equal. One of the limiting
factors in measuring usability is the makeup of the diagnostic
test itself. It typically tests a very small sample of participants;
it encourages those participants to take the time to think aloud
and to made useful verbal diversions as they work; it allows the
test administrator the freedom to probe interesting issues and
to take such actions as skipping tasks that won't be informative;
and it deals with a product that might be in prototype form
and, consequently, will occasionally malfunction. Those qualities make a diagnostic test good at exploring problems, but
limited at measuring usability.
Historically, human factors professionals have made a distinction between formative and summative measurement. A formative test is done early in development to contribute to a product's design; a summative test is performed late in development
to evaluate the design. A diagnostic test is clearly a formative
test. But what specifically is a summative usability test?
At the present time, without a comparison product, we are
left with the judgment of a usability specialist about how usable a product is based on their interpretation of a summative
usability test. Experienced usability professionals believe that
they can make a relatively accurate and reliable assessment of
a product's usability given data from a test designed to measure usability, that is, a test with a stable product and a larger
sample than is typical and one in which participants are discouraged from making verbal diversions, and the administrator
makes minimal interruptions to the flow of tasks. This expert
judgment is the basis of the common industry format (GIF) I describe below. Perhaps someday, we will be able to make more
precise measurements based directly on the measures that are
not filtered through the judgment of usability professional. In
the meantime, those judgments are the best estimate we have.

Comparing the Usability of Products


An important variation on the purpose of a usability test is one
that focuses primarily on comparing usability. Here the intention
is to measure how usable a product is relative to some other
product or to an earlier version of itself. There are two types
of comparison tests: (a) an internal usability test focused on
finding as much as possible about a product's usability relative
to a comparison product (a comparative test or a diagnostic,
comparative test) and (b) a test intended to produce results that
will be used to measure comparative usability or to promote the
winner over the others (a competitive usability test).
In both types of comparison tests, there are two important
considerations: (a) The test design must provide a valid comparison between the products and (b) the selection of test participants, the tasks, and the way the test administrator interacts
with participants must not favor any of the products.
Designing Comparison Tests. As soon as the purpose of the
test moves from diagnosis to comparison measurement, the test
design moves toward becoming more like a research design. To

56. User-Based Evaluations


demonstrate that one product is better on some measure, you
need a design that will validly measure the comparison. The
design issues usually focus on two questions:
Will each participant use all of the products, some of the products, or only one product?
How many participants are enough to detect a statistically
significant difference?
In the research methods literature, a design in which participants use all of the products is called a "within-subjects" design,
whereas in a "between-subjects" design each participant uses
only one product. If testers use a between-subjects design, they
avoid having any contamination from product to product, but
they need to make sure that the groups who use each product are equivalent in important ways. For example, in a typical
between-subject design, members of one group are recruited because they have experience with Product A, whereas a second
group is recruited because they have experience with Product
B. Each group then uses the product they know. But the two
groups need to have equivalent levels of experience with the
product they use. They also need to have equivalent skills and
knowledge with related variables, such as job titles and time
worked, general computer literacy, and so on.
Because it is difficult to match groups on all of the relevant
variables, between-subjects designs need to have enough participants in each group to wash out any minor differences. An
important concern to beware of in the between-subjects design
is the situation in which one of the participants in a group is
especially good or bad at performing tasks. Gray and Salzman
(1998) called this the "wildcard effect." If the group sizes are
small, one superstar or dud could dramatically affect the comparison. With larger numbers of participants in a group, the
wildcard has a smaller impact on the overall results. This phenomenon is one of the reasons that competitive tests have larger
sample sizes than diagnostic tests. The exact number of participants depends on the design and the variability in the data.
Sample sizes in competitive tests are closer to 20 in a group
than the 5 to 8 that is common in diagnostic tests.
If testers use a within-subjects design in which each participant uses all of the products, they eliminate the effect of
groups not being equivalent but then have to worry about
other problems, the most important of which are order and
sequence effects and the length of the test session. Because
within-subjects statistical comparisons are not influenced by inequalities between groups, they are statistically more powerful
than between-subjects designs, which means testers need fewer
participants to detect a difference. To eliminate effects due to
order and the interaction of the product with each other, you
need to counterbalance the order and sequence of the products. (See Fisher & Yates, 1963, and Dumas, 1998a, for rules for
counterbalancing.) They also have to be concerned about the
test session becoming so long that participants get tired.
There are some designs that are hybrids because they
use within-subjects comparisons but don't include all of the
combinations. For example, if tester's are comparing their product to two of their competitors, they might not care about how
the two competitors compare with each other. In that case,

1 109

each participant would use the testers' product and one of the
others, but no one would use both of the competitors' products.
This design allows the statistical power of a within-subjects
design for some comparisonsthose involving your product.
In addition, the test sessions are shorter than with the complete
within-subjects design.
Eliminating Bias in Comparisons. For a comparison test
to be valid, it must be fair to all of the products. There are at least
three potential sources of bias: the selection of participants, the
selection and wording of tasks, and the interactions between
the test administrator and the participants during the sessions.
The selection of participants can be biased in both a betweenand a within-subjects design. In a between-subjects design, the
bias can come directly from selecting participants who have
more knowledge or experience with one product. The bias can
be indirect if the participants selected to use one product are
more skilled at some auxiliary tasks, such as the operating system, or are more computer literate. In a competitive test using a between-subjects design, it is almost always necessary to
provide evidence showing that the groups are equivalent, such
as by having them attain similar average scores in a qualification test or by assigning them to the products by some random
process. In a within-subjects design, the bias can come from
having the participants have more knowledge or skill with one
product. Again, a qualification test could provide evidence that
they know each product equally well.
Establishing the fairness of the tasks is usually one of the
most difficult activities in a comparison test, even more so in
a competitive test. One product can be made to look better
than any other product by carefully selecting tasks. Every user
interface has strengths and weaknesses. The tasks need to be
selected because they are typical for the sample of users and
the tasks they normally do. Unlike a diagnostic test, the tasks
in a competitive test should not be selected because they are
likely to uncover a usability problem or because they probe
some aspect of one of the products.
Even more difficult to establish than lack of bias in task selection is apparent bias. If people who work for the company
that makes one of the products select the tasks, it is difficult to
counter the charge of bias even if there is no bias. This problem is why most organizations will hire an outside company or
consultant to select the tasks and run the test. But often the
consultant doesn't know enough about the product area to be
able to select tasks that are typical for end users. One solution is
to hire an industry expert to select or approve the selection of
tasks. Another is to conduct a survey of end users, asking them
to list the tasks they do.
The wording of the task scenarios can also be a source of bias,
for example, because they describe tasks in the terminology
used by one of the products. The scenarios need to be scrubbed
of biasing terminology.
Finally, the test administrator who interacts with each test
participant must do so without biasing the participants. The interaction in a competitive test must be as minimal as possible.
The test administrator should not provide any guidance in performing tasks and should be careful not to give participants rewarding feedback after task success. If participants are to be told

1110

DUMAS

when they complete a task, it should be done after every complete task for all products. Because of the variability of task times
it causes, participant should not be thinking aloud and should
be discouraged from making verbal tangents during the tasks.

Baseline Usability Tests


One of the ways to measure progress in user interface design is
by comparing the results of a test to a usability baseline. Without
a baseline, it can be difficult to interpret quantitative measures
from a test and put them in context. For example, if it takes
a sample of participants 7 minutes to complete a task with an
average of two errors, how does a tester interpret that result?
One way is to compare it to a usability goal for the task. Another
is to compare it to the results of the same task in an earlier
version of the product.
But establishing a baseline of data takes care. Average measures from a diagnostic usability test with a few participants
can be highly variable for two reasons. First, because of the
small number of participants, average scores can be distorted by
a wildcard. Because of this variability, it is best to use a sample
size closer to those from a comparison test than those from a
diagnostic test. Second, the thinking-aloud procedure typically
used in diagnostic tests adds to the variability in performing the
task. It is best not to have participants think aloud in a baseline
test, which makes the data cleaner but also lessens its value as
a diagnostic tool.

Allowing Free Exploration


An important issue in user testing is what the participant does
first. For example, if all users will have some training before they
use the product, the tester might want to provide this training.
There is often a preamble to the first task scenario that puts the
test and the tasks into some context. Most often, the preamble
leads to the first task scenario. Using this procedure immediately
throws the participant into product use. Some testers argue that
this procedure is unrealistic, that in the "real world" people
don't work that way but spend a few minutes exploring the
product before they start doing tasks. Others argue that going
directly to tasks without training or much of a preamble puts
stress on the product to stand on its own, stress that is beneficial
in making the product more usable.
Should testers consider allowing the test participants 5 to
10 minutes of exploration before they begin the task scenarios?
Those in favor of free exploration argue that without it, the product is getting a difficult evaluation and that the testing situation
is not simulating the real use environment, especially for Webbased products. Users must know something about the product
to buy it, or their company might give them some orientation to
it. Those against free exploration argue that it introduces added
variability into the test; some participants will find information
that helps them do the tasks, but others won't find the same
information. Furthermore, nobody really knows what users do
when no one is watching. A usability test session is a constructed
event that does not attempt to simulate every component of

the real use environment. Finally, the test is intended to be a


difficult evaluation for the product to pass. This debate continues, but most testers do not allow free exploration.

CHALLENGES TO THE VALIDITY OF USABILITY


TESTING
For most of its short history, user testing has been remarkably
free from criticism. Part of the reason for this freedom is the
high face validity of user testing, which means that it appears
to measure usability. User testing easily wins converts. When
visitors watch a test for the first time, they think they are seeing
a "real" user spontaneously providing their inner experiences
through their think-aloud protocol. Visitors often conclude that
they are seeing what really happens when no one is there to
watch customers. When a usability problem appears in the performance of a test participant, it is easy to believe that every
user will have that problem.
But some impressions of user testing can be wrong. A test
session is hardly a spontaneous activity. On the contrary, a user
test is a very constructed event. Each task and each word in
each scenario has been carefully chosen for a specific purpose.
And unfortunately, we don't know what really happens when
no one is watching.
In the past 5 years, researchers and practitioners have begun
to ask tough questions about the validity of user testing as part
of a wider examination of all usability evaluation methods. This
skepticism is healthy for the usability profession. Here I discuss
four challenges to validity:
1. How do we evaluate usability testing?
2. Why can't we map usability measures to user interface components?
3. Are we ignoring the operational environment?
4. Why don't usability specialists see the same usability problems?

How Do We Evaluate Usability Testing?


One of the consequences of making a distinction between usability testing and research is that it becomes unclear how to
evaluate the quality and validity of a usability test, especially
a diagnostic test. As I have noted, usability professionals who
write about testing agree that a usability test is not a research
study. Consequently, it is not clear whether the principles of
research design should be applied to a diagnostic usability test.
Principles, such as isolating an independent variable and having
enough test participants to compute a statistical test, do not apply to diagnostic usability testing. The six essential characteristics of user testing described above set the minimum conditions
for a valid usability test but do not provide any further guidance.
For example, are all samples of tasks equal in terms of ensuring
the valid of a test? Are some better than others? Would some
samples be so bad as to invalidate the test and its results? Would
any reasonable sample of tasks uncover the global usability

56. User-Based Evaluations


problems? Is a test that misses uncovering a severe usability
problem just imperfect, or is it invalid?
Dumas (1999) explored other ways to judge the validity of
a user test. For example, Skinner (1956) invented a design in
which causality between independent and dependent variables
"was established with only one animal. By turning the independent variable on and off several times with the same animal, he
was able to establish a causal relationship between, for example, a reinforcement schedule and the frequency and variability
of bar pressing or pecking. In some ways, Skinner's method
is similar to having the same usability problem show up many
times both between and within participants in a usability test.
In this analogy, usability problems that repeat would establish a
causal relationship between the presentation of the same tasks
with the same product and the response of the participants.
This relationship is exactly why a tester becomes confident that
problems that repeat are caused by a flawed design. But should
we end there? Should we only fix repeating problems? And what
if, as often happens, some participants don't have the problem?
It is not clear where to draw the repetition line.
Hassenzahl (1999) agued that a usability tester is like a clinician trying to diagnose a psychological illness. An effective tester
is one who is good at tying symptoms, that is, usability problems, with a causea poor design. In this analogy, a goal for
the profession is to create a diagnostic taxonomy to make problem interpretations more consistent. Gray and Salzman (1998)
and Lund (1998) have made similar points. Until that happens,
however, we are left looking for good clinicians (testers), but
we have little guidance about that makes a valid test.

Why Can't We Map Usability Measures


to User Interface Components?
An importantGray and Salzman (1998) said the most
importantchallenge to the validity of usability testing is the
difficulty of relating usability test measures to components of
the user interface. Practitioners typically use their intuition and
experience to make such connections. For example, a long task
time along with several errors in performing a task may be attributed to a poorly organized menu structure. Would other
testers make the same connection? Do these two measures always point to the same problem? Do these measures only point
to this one problem? Is the problem restricted to one menu or
several? Are some parts of the menu structure effective?
As we have seen, common practice in test reporting is to
group problems into more general categories. For example, difficulties with several words in an interface might be grouped under a "terminology" or a "jargon" category. Unfortunately, there
is no standardized set of these categories. Each test team can roll
their own categories. This makes the connection from design
component to measures even more difficult to make. Landauer
(1995) urged usability professions and researchers to link measures such as the variability in task times to specific cognitive
strategies people use to perform tasks. Virzi et al. (1993) compared the results of a performance analysis of objective measures
with the results of a typical think aloud protocol analysis. They
identified many fewer problems using performance analysis.

1111

That study and others suggest that many problems identified in


a usability test come from the think-aloud protocol alone. Could
some of these problems be false alarms, which is to say they are
not usability problems at all? Expert review, an inspection evaluation method, has been criticized for proliferating false alarms.
Bailey, Allan, and Raiello (1992) claimed that most of the problems identified by experts are false alarms. But they used the
problems that they identified from user testing as the comparison. If Bailey et al. are correct, most of the problems identified
by user testing also are false alarms. Their study suggests that
the only practice that makes any difference is to fix the one or
two most serious problems found by user testing.
Without a consistent connection between measures and user
interface components, the identification of problems in a user
test looks suspiciously like an ad hoc fishing expedition.

Are We Ignoring the Operational Environment?


Meister (1999) took the human factors profession to task for
largely ignoring the environment within which products and
systems are used (see Fig. 56.3). He asserted that in human
factors, the influence of the environment on the humantechnology interaction is critical to the validity of any evaluation. He proposed that human factors researchers have chosen
erroneously to study the interaction of people and technology
largely in a laboratory environment. He noted that "any environment in which phenomena are recreated, other than the one
for which it was intended, is artificial and unnatural" (p. 66).
Although Meister did not address usability testing directly, he
presumably would have the same criticism of the use of testing
laboratories to evaluate product usability.
Those who believe that it is important to work with users
in their operational environment as the usability specialists
gather requirements also believe that at least early prototype
testing should be conducted in the work environment (Beyer &
Holtzblatt, 1997). The assumption these advocates make is that
testing results will be different if the test is done in the work

Human - Technology
%

Tasks

1
i

FIGURE 56.3. The scope of human factors. From "Usability


testing methods: When does a usability test become a research experiment?" by J. Dumas, 2000, Common Ground, 10.
Reprinted with permission.

1112

DUMAS

environment rather than in a usability lab. These differences will


lead to designs that are less effective if the richness of the work
environment is ignored. The proponents of testing in the work
environment offer examples to support their belief, but to date
there are no research studies that speak to this issue.
One could imagine a continuum on which to place the influence of the operational environment on product use. For some
products, such as office productivity tools, it seems unlikely that
the operational environment would influence the usability of a
product. For some other products, such as factory floor operational software, the physical and social environments definitely
influence product use; and then there is a wide range of products that fall in between. For example, would an evaluation of
a design for a clock radio be complete if test participants didn't
have to read the time from across a dark room? Or shut the alarm
off with one hand while lying down in a dark room?
Meister admitted that it is often difficult to create or simulate
the operational environment. The classic case is an accident in
a power plant that happens once in 10 years. One of the reasons the operational environment is not considered more often
in usability evaluation is that it is inconvenient and sometimes
difficult to simulate. When we list convenience as a quality of a
usability lab, we need to keep in mind that for some products,
the lab environment may be insufficient for uncovering all of
the usability problems in products.

Why Don't Usability Specialists See the Same


Usability Problems?
Earlier I discussed the fact that usability specialists who viewed
sessions from the same test had little agreement about which
problems they saw and which ones were the most serious
(Jabobsen & John, 1998). There are two additional studies that
also speak to this point (Molich et al., 1998, 2001). These studies
both had the same structure. A number of usability labs were
asked to test the same product. They were given broad instructions about the user population and told that they were to do
a "normal" usability test. In the first study, four labs were included; in the second, there were seven. The results of these
studies were not encouraging. There were many differences in
how the labs went about their testing. It is clear form these studies that there is little commonality in testing methods. But even
with that caveat, one would expect these labs staffed by usability professionals to find the same usability problems. In the first
study, there were 141 problems identified by the four labs. Only
one problem was identified by all of the labs. Ninety-one percent
of the problems were identified by only one lab. In the second
study, there were 310 problems identified by the seven teams.
Again, only one problem was identified by all seven teams, and
75% of the problems were identified by only one team.
Our assumption that usability testing is good method for finding the important problems quickly has to be questioned by the
results of these studies. It is not clear why there is so little overlap
in problems. Are slight variations in method the cause? Are the
problems really the same but just described differently? We look
to further research to sort out the possibilities.

ADDITIONAL ISSUES
In this final section on usability testing, I discuss five final issues:
1. How do we evaluate ease of use?
2. How does user testing compare with other evaluation
methods?
3. Is it time to standardize methods?
4. Are there ethical issues in user-based evaluation?
5. Is testing Web-based products different?

How Do We Evaluate Ease of Use?


Usability testing is especially good at assessing initial ease of
learning issues. In many cases, a usability test probes the first
hour or two of use of a product. Testers see this characteristic
as an asset because getting started with a new product is often
a key issue. If users can't get by initial usability barriers, they
may never use the product again, or they may use only a small
part of it.
Longer term usability issues are more difficult to evaluate.
Product developers often would like to know what usability
will be like after users learn how to use a product. Will users
become frustrated by the very affordances that help them learn
the product in the first place? How productive will power users
be after 6 months of use?
Although there is no magic potion that will tell developers
what usability will be like for a new product after 6 months,
there are some techniques that address some long-term
concerns:
Repeating the same tasks one or more times during the
sessionthis method gets at whether usability problems persist when users see them again.
Repeating the testa few weeks in between tests provides
some estimate of long-term use.
Providing training to participants who will have it when the
product is releasedestablishing a proficiency criterion that
participants have to reach before they are tested is a way to
control for variations in experience.
Although these techniques sometimes are useful, assessing the
ease of use for a new product is difficult with any evaluation
method.

How Does Usability Testing Compare With Other


Evaluation Methods?
In the early 1990s, there were several research studies that
looked at the ability of user testing to uncover usability problems
and compared testing with other evaluation methods, especially
expert reviews and cognitive walkthroughs (Desurvire, 1994;
Jeffries, Miller, Wharton, & Uyeda, 1991; Karat, Campbell, &
Fiegel, 1992; Nielsen & Phillips, 1993). The evaluation methods

56. User-Based Evaluations


together are now called UEMsusability evaluation methods.
In these studies, testing generally came out quite well in comparison with the other methods. Its strengths were in finding
severe usability problems quickly and finding unique problems,
that is, problems not uncovered by other UEMs.
Jeffries et al. (1991) found that usability testing didn't uncover as many problems as an expert review and that no one expert found more than 40% of the problems. Furthermore, when
the authors segmented the problems by severity, usability testing found the smallest number of the least severe problems and
the expert reviewers found the most. Karat et al. (1992) compared usability testing to two kinds of walkthroughs and found
that testing found more problems and more severe problems.
In addition, usability testing uncovered more unique problems
than walkthroughs. Desurvire (1994) compared usability testing to both expert reviews and walkthroughs and found that
usability testing uncovered the most problems, the most severe
problems, and the most unique problems.
Dumas and Redish (1993), reviewing these studies from a
usability testing perspective, summarized the strengths of usability testing as uncovering more severe problems than the
other methods. Since that time, this clear-cut depiction of usability testing has been challenged. All of these studies and
more were reviewed by Gray and Salzman (1998) and Andre,
Williges, and Hartson (1999) in a meta-analysis of the comparison research. In Gray and Salzman's view, all of the studies are
flawed, being deficient in one or more of five types of validity. Their analysis makes it difficult to be sure what conclusions
to draw from the comparison studies. Andre et al. proposed
three criteria to evaluate UEMs: thoroughness (finding the most
problems), validity (finding the true problems), and reliability
(repeatedly finding the same problems). They found that they
only could compare UEM studies on thoroughness, with inspection methods being higher on it than usability testing. Andre
et al. could not find sufficient data to compare UEMs on validity or reliability. Fu, Salvendy, and Turley (1998) proposed
that usability testing and expert reviews find different kinds of
problems.
As we have described above, it now appears that the ability of experts or researchers to consistently agree on whether
problems are severe or not makes it difficult to tout usability testing's purported strength at uncovering severe problems quickly.
Even the conclusion that usability testing finds unique problems
is suspect because those problems might be false alarms. Andre
et al. proposed that usability testing be held up as the yardstick against which to compare other UEMs. But the assumption
that usability testing uncovers the true problems has not been
established.
Gray and Salzman's analysis was criticized by usability practitioners (Olson & Moran, 1998). The practitioners were not ready
to abandon their confidence in the conclusions of the comparison studies and continue to apply them to evaluate the products
they develop. To date, no one has shown that any of Gray and
Salzman's or Andre et al.'s criticisms of the lack of validity of
the UEM studies is incorrect. At present, the available research
leaves us in doubt about the advantages and disadvantages of
usability testing relative to other UEMs.

1113

Is It Time to Standardize Methods?


Several standards setting organizations have included user-based
evaluation as one of the methods they recommend or require for
assessing the usability of products. These efforts usually take a
long time to gestate and their recommendations are sometimes
not up to date, but the trends are often indicative of a method's
acceptance in professional circles.
The International Organization of Standardization (ISO) Standard ISO 9241, "Ergonomic requirements for office work with
visual display terminals (VDTs)," describes the ergonomic requirements for the use of visual display terminals for office tasks.
Part 11 provides the definition of usability, explains how to identify the information which is necessary to take into account
when evaluating usability, and describes required measures of
usability. Part 11 also includes an explanation of how the usability of a product can be evaluated as part of a quality system.
It explains how measures of user performance and satisfaction,
when gathered in methods such as usability testing, can be used
to measure product usability.
ISO/DIS 13407, "Human-centered design processes for interactive systems," provides guidance on human-centered design
including user-based evaluation throughout the life cycle of interactive systems. It also provides guidance on sources of information and standards relevant to the human-centered approach.
It describes human-centered design as a multidisciplinary activity, which incorporates human factors and ergonomics methods
such as user testing. These methods can enhance the effectiveness and efficiency of working conditions and counteract
possible adverse effects of use on human health, safety, and
performance.
One of the most interesting efforts to promote usability methods has been conducted by the U.S. Food and Drug Administration (FDA), specifically the Office of Health and Industrial
Programs, which approves new medical devices. In a report titled "Do It by Design" (http://www.fda.gov/cdrh/humfac/doit.
html), the FDA described what it considers best practices in
human factors methods that can be used to design and evaluate
devices. Usability testing plays a prominent part in that description. The FDA stops short of requiring specific methods but
does require that device manufacturers prove that they have an
established human factors program. The FDA effort is an example of the U.S. Governments' relatively recent but enthusiastic
interest in usability (http://www.usability.gov).
The most relevant standards-setting effort to those who conduct user-based evaluations is the National Institute of Standards and Technology's (NIST) Industry Usability Reporting
Project (IURP). This project has been underway since 1997.
It consists of more than 50 representatives of industry, government, and consulting who are interested in developing
standardized methods and reporting formats for quantifying usability (http://zing.ncsl.nist.gov/iusr/).
One of the purposes of the NIST IUSR project is to provide
mechanisms for dialogue between large customers, who would
like to have usability test data factored into the procurement
decision for buying software, and vendors, who may have usability data available.

1114

DUMAS

NIST worked with a committee of usability experts from


around the world to develop a format for usability test reports,
called the common industry format (GIF). The goal of the GIF
is to facilitate communication about product usability between
large companies who want to buy software and providers who
want to sell it. The GIF provides a way to evaluate the usability of the products buyers are considering on a common basis.
It specifies what should go into a report that conforms to the
GIF, including what is to be included about the test method, the
analysis of data, and the conclusions that can be drawn from
the analysis. The GIF is intended to be written by usability specialists and read by usability specialists. One of its assumptions
is that given the appropriate data specified by GIF, a usability
specialist can measure the usability of a product their company
is considering buying.
The GIF is not intended to apply to all usability tests. It applies
to a summative test done late in the development process to
measure the usability of a software product, not to diagnostic
usability tests conducted earlier in development.
The American National Standards Institute (ANSI) has created GIF as one of its standards (ANCI/NCITS 354-2001). The
GIF document is available from http://techstreet.com. It is difficult to know how this standard will be used, but it could mean
that in the near future vendors who are selling products to large
companies could be required to submit a test report in CIF
format.

Are There Ethical Issues in User Testing?


Every organization that does user testing needs a set of policies
and procedures for the treatment of test participants. Most organizations with policies use the federal government or American
Psychological Association policies for the treatment of participants in research. At the heart of the policies are the concepts of
informed consent and minimal risk. Minimal risk means that "the
probability and magnitude of harm or discomfort anticipated in
the test are not greater, in and of themselves, than those ordinarily encountered in daily life or during the performance of
routine physical or psychological examination or tests."1 Most
usability tests do not put participants at more than minimal risk.
If the test director feels that there may be more than minimal
risk, he or she should follow the procedures described in the
Notice of Proposed Rule Making in the Federal Register, 1988,
Vol. 53, No. 218, pp. 45661-45682.
Even if the test does not expose participants to more than
minimal risk, testers should have participants read and sign an
informed consent form, which should describe the purpose of
the test; what will happen during the test, including the recording of the session, what will be done with the recording, and
who will be watching the session; and the participants' right to
ask questions and withdraw from the test at any time. Participants need to have the chance to give their consent voluntarily.
For most tests, that means giving them time to read the form
and asking them to sign it as an indication of their acceptance
of what is in the form. For an excellent discussion of how to

create and use consent forms see Waters, Carswell, Stephens,


and Selwitz (2001).
The special situation in which voluntariness may be in question can happen when testers sample participants from their
own organizations. The participants have a right to know who
will be watching the session and what will be done with the
videotape. If the participants' bosses or another senior members
of the organization will be watching the sessions, it is difficult to
determine when the consent is voluntary. Withdrawing from the
session may be negatively perceived. The test director needs to
be especially careful in this case to protect participants' rights to
give voluntary consent. The same issue arises when the results
of a test with internal participants are shown in a highlight tape.
It that case, the participants need to know before the test that
the tape of the session might be viewed by people beyond the
development team. Test directors should resist making a highlight tape of any test done with internal participants. If that can't
be avoided, the person who makes the highlight tape needs to
be careful about showing segments of tape that place the participant in a negative light, even if only in the eyes of the participant.
The names of test participants also need to be kept in
confidence for all tests. Only the test director should be able
to match data with the name of a participant. The participants'
names should not be written on data forms or on videotapes.
Use numbers or some other code to match the participant
with their data. It is the test director's responsibility to refuse
to match names with data, especially when the participants are
internal employees.
This discussion should make it clear that it may be difficult to
interpret subjective measures of usability when the participants
are internal employees. Their incentive to give positive ratings
to the product may be increased when they believe that people
from their company may be able to match their rating with their
name.

Is Testing Web-Based Products Different?


There is nothing fundamentally different about testing Web
products, but the logistics of such tests can be a challenge
(Grouse, Jean-Pierre, Miller, & Goff, 1999). Often the users
of Web-based products are geographically dispersed and may
be more heterogeneous in their characteristics than users of
other technologies. The most important challenge in testing
these products is the speed with which they are developed
(Wichansky, 2000). Unlike products with traditional cyclic development processes, Web products often do not have released
versions. They are changed on a weekly, if not daily, basis. For
testing, this means gaining some control over the product being
tested. It needs to be stable while it is tested, not a moving target.
With control comes the pressure to produce results quickly.
Conducting a test in 8 to 12 weeks is no longer possible in
fast-paced development environments. Testing in 1 or 2 weeks
is more often the norm now. Testing with such speed is only
possible in environments where the validity of testing is not
questioned and the test team is experienced.

Notice of Proposed Rule Making in the Federal Register, 1988, Vol. 53, No. 218, p. 45663.

56. User-Based Evaluations

The Future of Usability Testing


Usability testing is clearly the most complex usability evaluation method, and we are only beginning to understand the
implications of that complexity. It appears that usability testing has entered into a new phase in which its strengths and
weaknesses are being seriously debated, although it remains
very popular and new usability labs continue to open. Before
1995 the validity of testing was seldom challenged. The recent research has opened up a healthy debate about our assumptions about this method. We can never go back to our
earlier innocence about this method, which looks so simple
in execution but whose subtleties we are only beginning to
understand.

WHICH USER-BASED METHOD TO USE?


Deciding which of the user-based evaluation methods to use
should be done in the context of the strengths and weaknesses

1115

of all of the usability inspection methods discussed in


Chapter 57. Among the user-based methods, direct or video
observation is useful in special situations. It allows usability specialists to observe populations of users who cannot otherwise
be seen or who can only be observed through the medium of
videotape. Questionnaires are a useful way to evaluate a broad
sample of users, to measure the usability of a product that has
been used by the same people over a long period of time, and
to sample repeatedly the same user population. The best questionnaires also have the potential to allow usability comparisons
across products and, perhaps, to provide an absolute measure of
usability. Usability testing can be used throughout the product
development cycle to diagnose usability problems. Its findings
have the most credibility with developers of all of the evaluation methods. As currently practiced, tests can be conducted
quickly and allow retesting to check whether solutions to usability problems are effective. Using testing to compare products
or to provide an absolute measure of usability requires more
tune and resources and testers who have knowledge of research
design and statistics.

References
Abelow, D. (1992). Could usability testing become a built-in product
feature? Common Ground, 2, 1-2.
Andre, T., Williges, R., & Hartson, H. (1999). The effectiveness of usability evaluation methods: Determining the appropriate criteria.
Proceedings of the Human Factors and Ergonomics Society, 43rd
Annual Meeting (pp. 1090-1094). Santa Monica, CA: Human Factors and Ergonomics Society.
Baber, C., & Stanton, N. (1996). Observation as a technique for usability
evaluation. InP.Jordan, B. Thomas, B. Weerdmeester, &I. McClelland
(Eds.), Usability evaluation in industry (pp. 85-94). London: Taylor
& Francis.
Bailey, R. W. (1993). Performance vs. preference. Proceedings of the
Human Factors and Ergonomics Society, 37th Annual Meeting
(pp. 282-286). Santa Monica, CA: Human Factors and Ergonomics
Society.
Bailey, R. W., Allan, R. W., & Raiello, P. (1992). Usability testing vs.
heuristic evaluation: A head-to-head comparison. Proceedings of the
Human Factors and Ergonomics Society, 36th Annual Meeting
(pp. 409-413). Santa Monica, CA: Human Factors and Ergonomics
Society.
Barker, R. T., & Biers, D. W. (1994). Software usability testing: Do user
self-consciousness and the laboratory environment make any difference? Proceedings of the Human Factors Society, 38th Annual
Meeting (pp. 1131-1134). Santa Monica, CA: Human Factors and
Ergonomics Society.
Bauersfeld, K., & Halgren, S. (1996). "You've got three days!" Case
studies in field techniques for the time-challenged. In D. Wixon
& J. Ramey (Eds.), Field methods casebook for software design
(pp. 177-196). New York: John Wiley.
Beyer, H., & Holtzblatt, K. (1997). Contextual design: Designing
customer-centered systems. San Francisco: Morgan Kaufmann.
Bias, R. (1994). The pluralistic usability walkthrough: Coordinated empathies. In J. Nielsen & R. Mack (Eds.), Usability inspection methods
(pp. 63-76). New York: John Wiley.
Boren, M., & Ramey, J. (2000, September). Thinking aloud: Reconciling
theory and practice. IEEE Transactions on Professional Communication, 1-23.

Bowers, V, & Snyder, H. (1990). Concurrent versus retrospective verbal


protocols for comparing window usability. Proceedings of the
Human Factors Society, 34th Annual Meeting (pp. 1270-1274).
Santa Monica, CA: Human Factors and Ergonomics Society.
Bradburn, N. (1983). Response effects. In R. Rossi, M. Wright, & J.
Anderson (Eds.), The handbook of survey research (pp. 289-328).
New York: Academic Press.
Branaghan, R. (1997). Ten tips for selecting usability test participants.
Common Ground, 7, 3-6.
Branaghan, R. (1998). Tasks for testing documentation usability.
Common Ground, 8, 10-11.
Brooke, J. (1996). SUS: A quick and dirty usability scale. In P. Jordan, B.
Thomas, B. Weerdmeester, & I. McClelland (Eds.), Usability evaluation in industry (pp. 189-194). London: Taylor & Francis.
Cantani, M. B., & Biers, D. W. (1998). Usability evaluation and prototype fidelity: Users and usability professionals. Proceedings of the
Human Factors Society, 42nd Annual Meeting (pp. 1331-1335).
Santa Monica, CA: Human Factors and Ergonomics Society.
Chignell, M. (1990). A taxonomy of user interface terminology. SIGCHI
Bulletin, 21, 27-34.
Chin, J. P., Diehl, V. A., & Norman, K. L. (1988). Development of an
instrument measuring user satisfaction of the human-computer interface. Proceedings of Human Factors in Computing Systems '88,
213-218.
Desurvire, H. W. (1994). Faster, cheaper! Are usability inspection methods as effective as empirical testing? In J. Nielsen & R. Mack (Eds.),
Usability inspection methods (pp. 173-202). New York: John
Wiley.
DeVries, C., Hartevelt, M., & Oosterholt, R. (1996). Private camera conversation: A new method for eliciting user responses. In P. Jordan,
B. Thomas, B. Weerdmeester, & I. McClelland (Eds.), Usability evaluation in industry (pp. 147-156). London: Taylor & Francis.
Dobroth, K. (1999, May). Practical guidance for conducting usability
tests of speech applications. Paper presented at the annual meeting
of the American Voice I/O Society (AVIOS). San Diego, CA.
Doll, W., & Torkzadeh, G. (1988). The measurement of end-user computing satisfaction. MIS Quarterly, 12, 259-374.

1116

DUMAS

Dumas, J. (1998a). Usability testing methods: Using test participants as


their own controls. Common Ground, 8, 3-5.
Dumas, J. (1998b). Usability testing methods: Subjective measures
Part ICreating effective questions and answers. Common Ground,
8, 5-10.
Dumas, J. (1998c). Usability testing methods: Subjective measures
Part IIMeasuring attitudes and opinions. Common Ground, 8,
4-8.
Dumas, J. (1999). Usability testing methods: When does a usability test
become a research experiment? Common Ground, 9, 1-5.
Dumas, J. (2000). Usability testing methods: The fidelity of the testing
environment. Common Ground, 10, 3-5.
Dumas, J. (2001). Usability testing methods: Think-aloud protocols. In
R. Branghan (Ed.), Design by people for people: Essays on usability.
Chicago: Usability Professionals' Association.
Dumas, J., & Redish, G. (1993). A practical guide to usability testing.
NJ: Ablex.
Dumas, J., & Redish, G. (1999). A practical guide to usability testing
(Rev. ed.). London: Intellect Books.
Ericsson, K. A., & Simon, H. A. (1993). Protocol Analysis: Verbal Reports
as Data. Cambridge, MA: MIT Press.
Fisher, R. A., & Yates, F. (1963). Statistical tables for biological, agricultural and medical research. Edinburgh, Scotland: Oliver & Boyd.
Frokjaer, E., Hertzum, M., & Hornbaek, K. (2000). Measuring usability: Are effectiveness, efficiency, and satisfaction really correlated?
Proceedings of Human Factors in Computing Systems '2000,
45-52.
Fu, L, Salvendy, G., & Turley, L. (1998). Who finds what in usability
evaluation. Proceedings of the Human Factors and Ergonomics
Society, 42nd Annual Meeting (pp. 1341-1345). Santa Monica, CA:
Human Factors and Ergonomics Society.
Gaba, D. M. (1994). Human performance in dynamic medical domains.
In M. S. Bogner (Ed.), Human error in medicine (pp. 197-224).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Gage, N., & Berliner, D. (1991). Educational psychology (5th ed.).
New York: Houghton Mifflin.
Goldberg, J. H. (2000). Eye movement-based interface evaluation: What
can and cannot be assessed? Proceedings of the IEA 2000/HFES
2000 Congress (44th Annual Meeting of the Human Factors and
Ergonomics Society) (pp. 625-628). Santa Monica, CA: Human Factors and Ergonomics Society.
Gray, W., & Salzman, M. (1998). Damaged merchandise? A review of experiments that compare usability methods [Special Issue]. HumanComputer Interaction, 13, 203-261.
Grouse, E., Jean-Pierre, S., Miller, D., & Goff, R. (1999). Applying usability methods to a large intranet site. Proceedings of the Human
Factors and Ergonomics Society, 43rd Annual Meeting (pp. 782786). Santa Monica, CA: Human Factors and Ergonomics Society.
Ground, C., & Ensing, A. (1999). Apple pie a-la-mode: Combining subjective and performance data in human-computer interaction tasks.
Proceedings of the Human Factors and Ergonomics Society, 43rd
Annual Meeting (pp. 1085-1089). Santa Monica, CA: Human Factors and Ergonomics Society.
Hackman, G. S., & Biers, D. W. (1992). Team usability testing: Are two
heads better than one? Proceedings of the Human Factors Society,
36th Annual Meeting (pp. 1205-1209). Santa Monica, CA: Human
Factors and Ergonomics Society.
Hartson, H. R., Castillo, J. C., Kelso, J., Kamler, J., & Neale, W. C. (1996).
Remote evaluation: The network as an extension of the usability
laboratory. Proceedings of Human Factors in Computing Systems
'96, 228-235.
Hassenzahl, M. (1999). Usability engineers as clinicians. Common
Ground, 9, 12-13.

Hughes, M. (1999). Rigor in usability testing. Technical Communication, 46, 488-494.


Igbaria, M., & Parasuraman, S. (1991). Attitudes towards microcomputers: Development and construct validation of a measure. International Journal of Man-Machine Studies, 34, 553-573.
Jacobsen, N., & John, B. (1998). The evaluator effect in usability
studies: Problem detection and severity judgments. Proceedings of
the Human Factors and Ergonomics Society, 42nd Annual Meeting
(pp. 1336-1340). Santa Monica, CA: Human Factors and Ergonomics
Society.
Jeffries, R., Miller, J., Wharton, C., & Uyeda, K. (1991). User interface
evaluation in the real world: A comparison of four techniques. Proceedings of Human Factors in Computing Systems '91, 119-124.
Kantner, L. (2001a). Following a fast-moving target: Recording user
behavior in Web usability testing. In R. Branaghan (Ed.), Design
by people for people: Essays on usability (pp. 235-244). Chicago:
Usability Professional's Association.
Kantner, L. (2001b). Assessing Web site usability from server log files. In
R. Branaghan (Ed.), Design by people for people: Essays on usability
(pp. 245-261). Chicago: Usability Professional's Association.
Karat, C. M., Campbell, R., & Fiegel, T. (1992). Comparison of empirical
testing and walk-through methods in user-interface evaluation. Proceedings of Human Factors in Computing Systems '92, 397-404.
Kennedy, S. (1989). Using video in the BNR usability lab. SIGCHI
Bulletin, 21, 92-95.
Kirakowski, J. (1996). The software usability measurement inventory (SUMI): Background and usage. In P. Jordan, B. Thomas, B.
Weerdmeester, & I. McClelland (Eds.), Usability evaluation in industry (pp. 169-177). London: Taylor & Francis.
Kirakowski, J., & Corbett, M. (1988). Measuring user satisfaction. In D.
Jones & R. Winder (Eds.), People and computers (Vol. IV, pp. 189217). Cambridge, England: Cambridge University Press.
Landauer, T. K. (1995). The trouble with computers. Cambridge, MA:
MIT Press.
Landay J. A., & Myers, B. (1995). Interactive sketching for the early
stages of user interface design. Proceedings of Human Factors in
Computing Systems '95, 43-50.
Law, C. M., & Vanderheiden, G. C. (2000). Reducing sample sizes
when user testing with people who have, and who are simulating disabilitiesexperiences with blindness and public information kiosks. Proceedings of the IDEA 2000/HFES 2000 Congress,
4, 157-160. Santa Monica, CA: Human Factors and Ergonomics
Society.
Ledgard, H. (1982). Evaluating text editors. Proceedings of Human
Factors in Computer System, 135-156.
Lesaigle, E. M., & Biers, D. W. (2000). Effect of type of information or
real-time usability evaluation: Implications for remote usability testing. Proceedings of the IEA 2000/HFES 2000 Congress, 6, 585-588.
Santa Monica, CA: Human Factors and Ergonomics Society.
Lewis, J. (1991). Psychometric evaluation of an after-scenario questionnaire for computer usability studies: The ASQ. SICCHI Bulletin, 23,
78-81.
Lewis, J. (1994). Sample size for usability studies: Additional considerations. Human Factors, 36, 368-378.
Lewis, J. R. (1995). IBM computer usability satisfaction questionnaires:
Psychometric evaluation and instructions for use. International
Journal of Human-Computer Interaction, 7, 57-78.
Lister, M. (2001). Usability testing software for the Internet. Proceedings
of Human Factors in Computing Systems 2001, 3, 17-18.
Lund, A. M. (1998). The need for a standardized set of usability metrics.
Proceedings of the Human Factors and Ergonomics Society, 42nd
Annual Meeting (pp. 688-691). Santa Monica, CA: Human Factors
and Ergonomics Society.

56. User-Based Evaluations

Meister, D. (1999). The history of human factors and ergonomics.


Mahwah, NJ: Lawrence Erlbaum Associates.
Mitropoulos-Rundus, D., & Muzak, J. (1997). How to design and conduct
a consumer in-home usability test. Common Ground, 7, 10-12.
Molich, R., Sevan, N., Curson, I., Butler, S., Kindlund, E., Miller, D.,
& Kirakowski, J. (1998). Comparative evaluation of usability tests.
Proceedings of the Usability Professionals'Association (pp. 1-12).
Dallas, TX: Usability Professionals' Association.
Molich, R., Kindlund, E., Seeley, J., Norman, K., Kaasgaard, K.,
Karyukina, B., Schmidt, L., Ede, M., van Oel, W., & Kahmann, R.
(2002). Comparative usability evaluation. In press.
Nielsen, J. (1992). Finding usability problems through heuristic evaluation. Proceedings of Human Factors in Computing Systems '92
(pp. 373-380).
Nielsen, J., & Phillips, V L. (1993). Estimating the relative usability
of two interfaces: Heuristic, formal, and empirical methods compared. Proceedings of the Association of Computerized Machinery
INTERCHI '93 Conference on Human Factors in Computing
Systems (pp. 214-221). New York: ACM Press.
Olson, G., & Moran, R. (1998). Damaged merchandise? A review of experiments that compare usability methods [Special Issue]. HumanComputer Interaction, 13, 203-261.
Orne, M. (1969). Demand characteristics and the concept of quasicontrols. In R. Rosenthal & R. Rosnow (Eds.), Artifact in behavioral
research (pp. 143-179). New York: Academic Press.
Perkins, R. (2001). Remote usability evaluation over the Internet. In R.
Branaghan (Ed.), Design by people for people: Essays on usability
(pp. 153-162). Chicago: Usability Professional's Association.
Philips, B., & Dumas, J. (1990). Usability testing: Functional requirements for data logging software. Proceedings of the Human Factors
Society, 34th Annual Meeting (pp. 295-299). Santa Monica, CA:
Human Factors and Ergonomics Society.
Rubin, J. (1994). Handbook of usability testing. New York: John Wiley.
Scholtz, J., & Bouchette, D. (1995). Usability testing and group-based
software: Lessons from the field. Common Ground, 5, 1-11.
Shneiderman, B. (1987). Designing the user interface: Strategies for
effective human computer interaction. Reading, MA: AddisonWesley.
Shneiderman, B. (1992). Designing the user interface: Strategies for
effective human computer interaction (2nd ed.). Reading, MA:
Addison-Wesley.
Shneiderman, B. (1997). Designing the user interface: Strategies for

1117

effective human computer interaction (3rd ed.). Reading, MA:


Addison-Wesley.
Skinner, B. F. (1956). A case history in scientific method. American
Psychologist, 11, 221-233.
Spenkelink, G., Beuijen, K., & Brok, J. (1993). An instrument for measurement of the visual quality of displays. Behaviour and Information
Technology, 12, 249-260.
Thomas, B. (1996). Quick and dirty usability tests, hi P. Jordan, B.
Thomas, B. Weerdmeester, & I. McClelland (Eds.), Usability evaluation in industry (pp. 107-114). London: Taylor & Francis.
Virzi, R. A. (1990). Streamlining the design process: Running fewer
subjects. Proceedings of the Human Factors Society, 34th
Annual Meeting (pp. 291-294). Santa Monica, CA: Human Factors
and Ergonomics Society.
Virzi, R. A. (1992). Refining the test phase of usability evaluation: How
many subjects is enough? Human Factors, 34, 457-468.
Virzi, R. A., Sokolov, J. L., & Karis, D. (1996). Usability problem identification using both low and high fidelity prototypes. Proceedings of
Human Factors in Computing Systems '96, 236-243.
Virzi, R. A., Sorce, J. E, & Herbert, L. B. (1993). A comparison of three
usability evaluation methods: Heuristic, think-aloud, and performance testing. Proceedings of the Human Factors and Ergonomics
Society, 37th Annual Meeting, 309-313.
Vora, P. (1994). Using teaching methods for usability evaluations.
Common Ground, 4, 5-9.
Waters, S., Carswell, M., Stephens, R., & Selwitz, A. (2001). Research
ethics meets usability testing. Ergonomics in Design, 9, 14-20.
Wichansky A. (2000). Usability testing in 2000 and beyond. Ergonomics,
43, 998-1006.
Wiklund, M., Dumas, J., & Thurrott, C. (1992). Does the fidelity
of software prototypes affect the perception of usability? Proceedings of the Human Factors Society, 36th Annual Meeting
(pp. 1207-1212). Santa Monica, CA: Human Factors and Ergonomics
Society.
Wilson, C. E., & Coyne, K. P. (2001). Tracking usability issues: To bug
or not to bug? Interactions, 8, 15-19.
Wolf, C. G. (1989). The role of laboratory experiments in HCI: Help,
hindrance or ho-hum? Proceedings of Human Factors in Computing Systems '89, 265-268.
Young, R., & Barnard, R (1987). The use of scenarios in HCI research:
Turbo charging the tortoise of cumulative science. Proceedings of
Human Factors in Computing Systems '87, 291-296.

Vous aimerez peut-être aussi