Using Threads in Interactive Systems: A Case Study: Star, Viewpoint, Globalvtew For X Windows Docuprint

Using Threads in Interactive Systems: A Case Study
Carl Hauser, Christian Jacobi, Marvin Theimer, Brent Welch and Mark Weiser
Computer Science Laboratory

Xerox PARC
3333 Coyote Hill Road
Palo Alto, California 94304
Abstract. We describe the results of examining two programming environment and Xerox’s STAR,
large research and commercial systems for the ways ViewPoint, GlobalVtew for X Windows and
that they use threads. We used three methods: DocuPrint products have used lightweight threads
analysis of macroscopic thread statistics, analysis of for over 10 years [Smith82][Swinehart86]. This
the microsecond spacing between thread events, and paper reports on an inspection and analysis of the
code developed at Xerox in an attempt to detect
reading the implementation code. We identify ten
common paradigms and common mistakes of
different paradigms of thread usage: defer work,
programming with threads. Our analysis is both
general pumps, slack processes, sleepers, one-shots, static, from reading many lines of ancient and
deadlock avoidance, re] uuenatlon, serializes, modern modules, and dynamic, using a number of
encapsulated fork and exploiting parallelism. While tools we built for detailed thread inspection. We
some, like defer work, are well known, others have distinguish two systems that developed
not been previously described. Most of the independently although most of our remarks apply
paradigms cause few problems for programmers and to both. One, originally a research system but now
help keep the resulting system implementation underlying a number of products as well, we call
understandable. The slack process paradigm is both Cedar. The other, a product system originally
particularly effective in improving system derived from the same base as Cedar but relatively
unconnected to it for over ten years, we call GVX.
performance and particularly difficult to make work
well. We observe that thread priorities are difficult We believe that the systems we examined are the
to use and may interfere in unanticipated ways with largest and longest-used thread-based interactive
other thread primitives and paradigms. Finally, we systems in everyday use in the world. They are both
glean from the practices in this code several possible in use and under continual development. They
future research topics in the area of thread contain approximately 2.5 million lines of code, in
abstractions. 10,000 modules written by hundreds of people,
declaring more than 1000 monitors and monitored
record types and over 300 condition variables. They
1. Introduction mostly run as very large (working sets of 10’s of
megabytes) shared-memory, multi-application
systems. They have been ported to a variety of
Threads sharing an address space are becoming
processors and multiprocessors, including the .25 to
more widely available in popular operating systems
5 MIPS Xerox D-machines in the 1980’s and 5 to 100
such as Solaris 2.x, 0S2/2.x, Windows NT and
MIPS processors of the early 1990’s. Our
SysVR4 [Powel191][Custer93]. Xerox PARC’s Cedar
examination of them reveals the kinds of thread
facilities and practices that the programmers of
We thank all the hundreds of Cedar and GVX programmerswho these systems found useful and also the kinds of
made this work possible. Alan Demers created the PCR thread thread programming mistakes that even this
package and participated m many discussionsregarding thread
primitives and scheduling. John Corwin and Chris Uhler experienced community still makes.
assistedus in building and using instrumented versionsof GVX
Portions of this work were supported by Xerox, and by ARPA Our analysis is subject to two unavoidable biases.
under contract DABT63-91-C-0027, First, the programmers of these systems may not be
representative—for instance, one community
Permission to copy without fee all or part of this material IS
granted provided that the copies are not made or distributed for
consisted primarily of PhD researchers. Second, any
dtrect commercial advantage, the ACM copyright notice and the
group of people develop habits of work—idioms—
title of the publication and its date appear, and notice is given that may be unrelated to best practice, but are
that copying is by permission of the Association for Computing simply “how we do it here”. Although our report
Machinery. To copy otherwise, or to tepubllsh, requires a fee draws on code written by two relatively independent
and/or specific permission. communities, comparing and contrasting the two is
SIGOPS ‘93/12 /93/N. C., USA beyond the scope of this paper.
@ 1993 ACM 0-89791 -632 -8/93 /0012 . ..$1 .50
94
We examined the systems running on the Portable invocation. A thread may be JOINed at most once. If
Common Runtime on top of Unix [Weiser89]. PCR a thread will not be JOINed it should be DETACHed,
provides the usual primitives: threads sharing a which tells the thread implementation that it can
single address space, monitor locks, condition recover the resources of the thread when it
variables, fork and join. Section 2 describes the terminates.
primitives in more detail.
The language provides monitors and condition
Although these systems do run on multiprocessors, variables for synchronizing thread activities. A
this paper emphasizes the role of threads in program monitor is a set of procedures, or module, that share
structuring rather than how they are used to exploit a mutual exclusion lock, or mutex. The mutex
multiprocessors [Owicki89]. protects any data managed by the module by
This paper is not, for the most part, a statistical ensuring that only a single thread is executing
analysis of thread behavior. We focus mostly on the within the module at any instant. Other threads
microscopic view of individual paradigms and wanting to enter the monitor are enqueued on the
critical path performance. For example, the time mutex. The Mesa compiler automatically inserts
between when a key is pressed and the locking code into monitored procedures. A variant
corresponding glyph is echoed to a window is very on this scheme, associating locks with data
important to the usability of these systems. structures instead of with modules, is occasionally
Developing an understanding of how individual used in order to obtain finer grain locking.
threads are interacting in accomplishing this task
Condition variables (CVS) give more explicit control
helps to understand the observed performance.
of thread scheduling. Each CV represents a state of
Information for this paper came from three sources. the module’s data structures (a condition) and a
An initial analysis of the dynamic behavior of our queue of threads waiting for that condition to
systems in terms of macroscopic thread statistics is become true. A thread uses the WAIT operation on a
presented in Section 3. A more detailed analysis of CV if it has to wait until the condition holds. WAIT
dynamic behavior was obtained from microsecond- operations may time out depending on the timeout
resolution information gathered about thread interval associated with the CV. A thread uses
events and scheduling events in the Unix kernel. NOTIFY or BROADCASTto signal waiting threads that
The kinds of thread events we examined included the condition has been achieved. The compiler
forks, yields, (thread) scheduler switches, monitor enforces the rule that CV operations are only
lock entries, and condition variable waits. Finally, invoked with the monitor lock held. The WAIT
to develop a database of static uses of threads we operation atomically releases the monitor lock and
used grep to locate all uses of thread primitives and adds its calling thread to the CV’S wait queue.
then read the surrounding code. This reading led us NOTIFY causes a single thread that is waiting on the
to further searching for uses of modules that provide CV’S wait queue to become runnable—exactly one
specialized access to threads (for example, the Cedar waiter wakens behavior. (Note that some thread
package PeriodicalFork). The understanding packages define their analog of NOTIFY to have at
gained from these three information sources least one waiter wakens behavior [Birrel1911.)
eventually led to the classifications of thread BROADCAST causes all threads that are waiting on
paradigms described in Section 4. the CV to become runnable. In either case, threads
Sections 5 and 6 present some engineering lessons— must compete for the monitor’s mutex before
both for implementors using threads and for reentering the monitor.
implementors of thread systems—from this study.
Unlike the monitors originally described by Hoare
Our conclusions and some suggestions for future
[Hoare74], the Mesa thread model does not
work are given in Section 7.
guarantee that the condition associated with a CV is
satisfied when a WAIT completes. If BROADCAST is
2. Thread model used, for example, a different thread might acquire
the monitor lock first and change the state of the
Lampson and Redell describe the Mesa language’s
program. Therefore a thread is responsible for
thread model and provide rationale for many of the
rechecking the condition after each WAIT. Thus, the
design choices [Lampson80]. Here, we summarize
prototypical use of WAIT is inside a WHILE loop that
the salient features as used in our systems.
checks the condition, not inside an IF statement that
The Mesa thread model supports multiple, light- would only check the condition once. Programs that
weight, pre-emptively scheduled threads that share obey the “WAIT only in a loop” convention are
an address space. The FORK operation creates a new insensitive to whether NOTIFY has at least one waiter
thread to carry out the FORK’s procedure-invocation wakens behavior or exactly one waiter wakens
argument. FORK returns a thread value. The JOIN behavior as described above. Indeed, under this
operation on a thread value returns the value convention BROADCAST can be substituted for
returned by the corresponding FORK’s procedure NOTIFY without affecting program correctness, so
95
NOTIFY is just a performance hint. long-lived thread, would run for a relatively short
while and then exit.
Threads have priorities that affect the scheduler.
The scheduler runs the highest priority runnable A Cedar or GVX world uses a moderate number of
thread and if there are several runnable threads at threads. Consider Cedar first: an idle Cedar system
the highest priority then round-robin is used among has about 35 eternal threads running in it and forks
them. If a system event causes a higher priority a transient thread once a second on average.
thread to become runnable, the scheduler will Keyboard activity can cause up to 5 thread forks per
preempt the currently running thread, even if it second, although most other user-interface activity
holds monitor locks. There are 7 priorities in all, causes much smaller increases in thread forking
with the default being the middle priority (4). rates. While one of our benchmark applications
Typically, lower priority is used for long running, (document formatting) employed large numbers of
background work, while higher priority is used for transient threads (forking 3.6 threads/see.), the
threads associated with devices or aspects of the other two compute-intensive applications we
user interface, keeping the system responsive for examined caused thread-forking activity to decrease
interactive work. A thread’s initial priority is set by more than a factor of 3. In all our benchmarks,
when it is created. It can change its own priority. the maximum number of threads concurrently
The timeslice interval and the CV timeout existing in the system never exceeded 41, although
granularity in the current implementation are each users employ two to three times this many in
50 milliseconds. The scheduler runs at least that everyday work. Transient threads are by far the
often, but also runs each time a thread blocks on a most numerous resulting in an average lifetime for
mutex, waits on a CV, or calls the YIELD primitive. non-eternal threads that is well under 1 second.
The only purpose of the YIELD primitive is to cause
An idle GVX world exhibits noticeably different
the scheduler to run (but see discussion later of
behavior than just described. An idle system
YieldButNotToMe ). The scheduler takes less than
contains 22 eternal threads and forks no additional
50 microseconds to switch between threads on a
threads. In fact, no additional threads are forked for
Sparcstation-2.
any user interface activity, be it keyboard, mouse, or
windowing activity.
3. Dynamic thread behavior
Table 1: Forking and thread-switching rates
One of the original motivations for our work was a
Cedar Forkslsec Thread
desire to understand the dynamic behavior of user-
Switcheslsec
level threads. For this purpose, we constructed an
instrumented version of PCR that measured the Idle Cedar 0.9 132
number of threads in the system, thread lifetimes, Keyboard input 5.0 269
the run length distribution of threads and the rate Mouse movement 1.0 191
at which monitor locks and condition variables are Window scrolling 0.7 172
used. Analysis of the data obtained from this Document formatting 3.6 171
instrumented system led to the realization that Document previewing 1.6 222
there were a number of consistent patterns of thread Make program 0.3 170
usage, which led to the static analysis on which this Compile 0.3 135
paper is focused. To give the reader some
GVX
background and context for this static analysis, we
present a summary of our dynamic data below. This Idle GVX o 33
data is based on a set of benchmarks intended to be Keyboard input o 60
typical of user activity, including compilation, Mouse movement o 34
formatting a document into a page description Window scrolling o 43
language (like Postscript), previewing pages
The rate at which a Cedar system switches among
described by a page description language and user
running threads varies from 130/sec. for an idle
interface tasks (keyboarding, mousing and scrolling
system to around 2701sec for a system experiencing
windows). All data was taken on a Sparcstation-2
heavy key boardlmouse input activity. Thread
running SunOS-4. 1.3.
execution intervals (the lengths of time between
Looking at the dynamic thread behavior, we thread switches) exhibit a peak at about 3
observed several different classes of threads. There milliseconds, with about 757o of all execution
were eternal threads that repeatedly waited on a intervals being between O and 5 milliseconds in
condition variable and then ran briefly before length. This is due to the very short execution
waiting again. There were worker threads that intervals of most eternal and transient threads. A
were forked to perform some activity, such as second peak is around 45 milliseconds, which is
formatting a document. Finally, there were short- related to the PCR time-slice period, which is 50
lived transient threads that were forked by some milliseconds. Transient and eternal thread activity
96
steals the first part of a timeslice with the transient threads to be forked by the main
remainder going to worker threads. formatting worker thread, whereas compilation and
document previewing cause a moderate number of
While most execution intervals are short, longer transient forks. While the compiler’s and
execution intervals account for most of the total previewer’s transient threads simply run to
execution time in our systems. Between 20$% and completion, each of the document formatter’s
50% of the total execution time during any period is transient threads fork one or more additional
accumulated by threads running for periods of 45 to transient threads themselves. However, third
50 milliseconds. We also examined the total generation forked threads do not occur. In fact, none
execution time contribution as a function of thread of our benchmarks exhibited forking generations
priority. Only two patterns were evident: of the 7 greater than 2. That is, every transient thread was
available priority levels one wasn’t used at all, and either the child or grandchild of some worker or
user interface activity tended to use higher long-lived thread.
priorities for its threads than did user-initiated
tasks such as compiling. Checking whether a program needs recompiling
(the Make program ) does not cause any threads to be
GVX switches among threads at a decidely lower forked (the command-shell thread gets used as the
rate: an idle system switches only 33 times per main worker thread), except for garbage collection
second, while heavy keyboard activity will drive the and finalization of collected data objects. Each of
rate up to 60/sec. The same hi-modal distribution of these activities causes a moderate number of first-
execution intervals is exhibited as in Cedar: generation transient threads to be forked.
between 50?7.and 70% of all execution intervals are
between O and 5 milliseconds in length with a Table 2: Wait-CV and monitor entry rates
second peak around 45 milliseconds. Between 30% Cedar Waitslsec 7C timeouts ML-enters
and 80% of the total execution time during any per sec
period is accumulated by threads running for
Idle Cedar 121 82% 414
periods of45 to 50 milliseconds.
Keyboard input 185 48% 2557
GVX’S use of thread priorities was noticeably Mouse movement 163 58% 1025
different than Cedar’s. While Cedar’s core of long- Window scrolling 115 69% 2032
lived threads are relatively evenly distributed over Doc formatting 130 72% 2739
the four “standard” priority values of 1 to 4, GVX Doc previewing 157 56% 1335
sets almost all of its threads to priority level 3, using Make program 158 61% 2218
the lower two priority levels only for a few Compile 119 82% 1365
background helper tasks. Two of the five low-
priority threads in fact never ran during our GVX
experiments. As with Cedar, one of the 7 priority Idle GVX 32 99% 366
levels is never used. However, while Cedar uses Keyboard input 38 42% 1436
level 7 for interrupt handling and doesn’t use level Mouse movement 33 96% 410
5, GVX does the opposite. In both systems, priority Window scrolling 25 61% 691
level 6 gets used by the system daemon that does
proportional scheduling. Cedar also uses level 6 for The rate at which locking and condition variable
its garbage collection daemon. primitives are used is another measure of thread
activity. Table 2 shows the rates for each
One interesting behavior that our Cedar thread benchmark. The rate of waiting on CVS in Cedar
data exhibited was a variety of different forking ranged from 1151second to 185/second, with 50% to
patterns. An idle Cedar system forks a transient 80% of these waits timing out rather than receiving
thread about once every 2 seconds. Each forked a wakeup notification. Monitors are entered much
thread, in turn, forks another transient thread. more frequently, reflecting their use to protect data
Keyboard activity causes a transient thread to be structures (especially in reusable library packages).
forked by the command-shell thread for every Entry rates varied from 400/second for an idle
keystroke. On the other hand, simply moving the system to 2500/sec for a system experiencing heavy
mouse around causes no threads to be forked. Even keyboard activity to 27001second for document
clicking a mouse button (e.g. to scroll a window) formatting. Contention was low, however, occuring
causes no additional forking activity. (However, on 0.0170 to 0.1% of all entries to monitors.
both keyboard activity and mouse motion cause
significant increases in activity by eternal threads. ) For GVX, the rate of waiting on CVS ranged from
Scrolling a text window 10 times causes 3 transient 321second to 381second, with 42% to 99% of these
threads to be forked, one of which is the child of one waits timing out rather than receiving a wakeup
of the other transients. notification. Monitors are entered at rates between
366/see and 1436/sec. Interestingly, contention for
Document formatting causes a great number of monitor locks was sometimes significantly higher in
97
GVX than in Cedar, occuring 0.4% of the time when sleepers and one-shots (common uses probably
scrolling a window and 0.2?10of the time when heavy omitted by Birrell because synchronization
keyboard traffic was present. problems in them are rare)
Table 3: Number of different CVS and monitor – deadlock avoiders (new)
locks used – task rejuvenation (new)
Cedar # Cvs # MLs
– serializes, another kind of specialized pump (new)
Idle Cedar 22 554
Keyboard input 32 918 – concurrency exploiters (same as Birrell’s)
Mouse movement 26 734 – encapsulated forks, which are forks in packages
Window scrolling 30 797 that capture certain paradigms and whose
Document formatting 46 1060 uses are themselves counted in the other
Document previewing 32 938 categories.
Make program 24 1296
These static categories complement the thread
Compile 36 2900
lifetime characterization in Section 3. The eternal
threads tend to be sleepers, pumps and serializes
GVX
with nothing to do. Worker threads are often work
Idle GVX 5 48
deferrers and transient threads are often deadlock
Keyboard input 7 204
avoiders. But the reader is cautioned that the static
Mouse movement 5 52
paradigm can’t be predicted from the dynamic
Window scrolling 6 209
lifetime.
Typically, most of the monitor/condition variable
traffic is observed in about 10 to 15 different
threads, with the worker thread of a benchmark 4.1 Defer work
activity dominating the numbers, The other active
Deferring work is the single most common use of
threads exhibit approximately equal traffic. The
forking in these systems. A procedure can often
number of different monitors entered during the
reduce the latency seen by its clients by forking a
benchmarks varies from 500 to 3000 as shown in
thread to do work not required for the procedure’s
Table 3. In contrast, only about 20 to 50 different
return value. Sometimes work can be deferred to
condition variables are waited for in the course of
times when the system is under less load [Birrel191].
the benchmarks.
Cedar practice has been to introduce work deferrers
freely as the opportunity to use them is noticed.
4. Thread paradigms
Many commands fork an activity whose results will
Birrell provides a good introduction to some of the be reported in a separate window: control in the
basic paradigms for thread use [Birrel19 1]. Here we originating thread returns immediately to the user,
go further into more advanced paradigms, their an example of latency reduction for the human
necessity and frequency in practice and the client. Some examples of work deferrers are:
problems of performance and correctness entailed by – forking to print a document
these advanced usages.
– forking to send a mail message
Birrell suggests that forking a new thread is useful
in several situations: to exploit concurrency on a – forking to create a new window
multiprocessor (including waiting for 1/0, where one – forking to update the contents of a window
processor is the 1/0 device); to satisfy a human user
Some threads are themselves so critical to system
by making progress on several tasks at once; to responsiveness that they fork to defer almost any
provide network service to multiple clients work at all beyond noticing what work needs to be
simultaneously; and to defer work to a less busy
done. These critical threads play the role of
time [Birrel191, p. 109].
interrupt handlers. Forking the real work allows it
We examined about 650 different code fragments to be done in a lower priority thread and frees the
that create threads, We gradually developed a critical thread to respond to the next event. The
collection of categories that we could use to explain keyboard-and-mouse watching process, called the
how thread uses were similar to one another and Notifier, is such a critical, high priority thread in
how they differed. Our final list of categories is: both Cedar and GVX.
– defer work (same as Birrell’s) 4.2 Pumps
pumps (components of Birrell’s pipelines. He Pumps are components of pipelines. They pick up
describes them for exploiting input from one place, possibly transform it in some
multiprocessing, but we saw them mostly way and produce it as output someplace else. 1
used for structuring.) Bounded buffers and external devices are two
– slack processes, a kind of specialized pump (new) common sources and sinks. The former occur in
98
several implementations in our systems for background thread which cleans pages dirtied by
connecting threads together, while the latter are other threads. If it gets too far behind in its work it
accessed with system calls (read, write) and shared could cause virtual memory thrashing by cleaning
memory (raw screen IO and memory shared with an pages no longer resident in physical memory. Its
external X server). rate of awakening must depend on the amount of
page dirtying which depends on the workload of all
Though Birrell suggests creating pipelines to
the other threads in the system.
exploit parallelism on a multiprocessor, we find
them most commonly used in our systems as a OneShots are sleeper processes that sleep for a
programming convenience, another reflection of the while, run and then go away. This paradigm is used
uniprocessor heritage of the systems. This is also repeatedly in Cedar, for example, to implement
their primary use in the well-known Unix shell guarded buttons of several kinds. (A guarded button
pipelines. For example, in our systems all user must be pressed twice, in close, but not too close
input is filtered through a pipeline thread that succession. They usually look like ‘Wwttem” on the
preprocesses events and puts them into another screen.) After a one-shot is forked it sleeps for an
queue, rather than have each reader thread arming period that must pass before a second click is
preprocess on demand. Neither approach acceptable. Then it changes the button appearance
necessarily provides a more efficient system, but the from ‘%-u+tcm” to “Button” and sleeps a second time.
pipeline is conceptually simpler: tokens just appear During this period a second click invokes a
in a queue. The programmer needs to understand procedure associated with the button, but if the
less about the pieces being connected. timeout expires without a second click, the one-shot
just repaints the guarded button.
One interesting kind of pump is the slack process. A
slack process explicitly adds latency to a pipeline in 4.4 Deadlock avoiders
the hope of reducing the total amount of work done,
either by merging input or replacing earlier data Cedar often uses FORK to avoid violating lock order
with later data before placing it on its output. Slack constraints. The window manager makes heavy use
processes are useful when the downstream of this paradigm. For example, after adjusting the
boundary between two windows the contents of the
consumer of the data incurs high per-transaction
costs. The buffer thread discussed in Section 5,2 is windows must be repainted. The boundary-moving
an example of a slack process. thread forks new threads to do the repainting
because it already holds some, but not all of the
4.3 Sleepers and oneshots locks needed for the repainting. Acquiring these
locks would require unwinding the adjusting
Sleepers are processes that repeatedly wait for a process far enough to release locks that would
triggering event and then execute, Often the violate locking order, then reacquiring all the
triggering event is a timeout. Examples include: necessary locks in the right order. It is far simpler to
call this procedure in K seconds; blink the cursor in fork the painting threads, unwind the adjuster
M milliseconds; check for network connection completely and let the painters acquire the locks
timeout every T seconds. Other common events are that they need in separate threads.
external input and service callbacks from other Another case of deadlock avoidance is forking the
activities. For instance, our systems use callbacks callbacks from a service module to a client module,
from the garbage collector to finalize objects and Forking permits the service thread to proceed,
callbacks from the filesystem when files change eventually releasing locks it holds that will be
state. These callbacks are removed from time- needed by the client. The fork also insulates the
critical paths in the garbage collector and filesystem service from things that may go wrong in the client
by putting an event in a work queue serviced by a callback. For instance, Cedar permits clients to
sleeper thread. The client’s code is then called from register callback procedures with the garbage
the sleeper. collector that are called to finalize (clean up) data
Sleepers frequently do very little work before structures. The finalization service thread forks
sleeping again. For instance, various cache each callback.
managers in our systems simply throw away aged 4.5 Task rejuvenation
values in a cache then go back to sleep.
Sometimes threads get into bad states, such as arise
Another kind of sleeper is the garbage collector’s
from uncaught exceptions or stack overflow, from
which recovery is impossible within the thread
lWe use the term pump, rather than the more common fzlter itself. In many cases, however, cleanup and recovery
because data transformation, ala filters, IS just one of the things is possible if a new “task rejuvenation” thread is
that pumps can do. In general, pumps control both the data
transformation and the timing of the transfer. We also like the
forked. For uncaught errors, an exception handler
connotation of an active entity conveyed by pump as opposed to may simply fork a new copy of the service. For stack
the passivity of filter overflow, a new thread is forked to report the stack
99
overflow, Using threads for task rejuvenation can be wakeups are prompted solely by the passage of time.
tricky and is a bit counter-intuitive. (This thread is MBQueue
in trouble. Ok let’s make two of them!) However, it is
The module MBQueue (the name means
a paradigm that adds significantly to the robustness
Menu/Button Queue) encapsulates the serialize
of our systems and its use is growing. A recent
paradigm in our systems. MBQueue creates a queue
addition is a task-rejuvenating FORK that was added
as a serialization context and a thread to process it.
to the input event dispatcher in Cedar. The
Mouse clicks and key strokes cause procedures to be
dispatcher makes unforked callbacks to client
enqueued for the context: the thread then calls the
procedures because (a) this code is on the critical
procedures in the order received.
path for user-visible performance and (b) most
callbacks are very short (e.g. enqueue an event) and We consider a queue together with a thread
so a fork overhead would be significant. But not processing it an important building block of user
forking makes the dispatcher vulnerable to interfaces. This is borne out by the fact that our
uncaught runtime errors that occur in the callbacks. system actually contains several minor variations of
Using task rejuvenation, the new copy of the MBQueue. Why is there not a more general
dispatcher keeps running. package? It seems that each instance adds
serialization to a specific interface already familiar
Task rejuvenation is a controversial paradigm. It’s
to the programmer. Furthermore, the serialization
ability to mask underlying design problems
is often on a critical interactive performance path.
suggests that it be used with caution.
Keeping a familiar interface to the programmer and
4.6 Serializes reduc-ing latency cause new variations to be
preferred over a single generic implementation.
A serialize is a queue and a thread that processes
the work on the queue. The queue acts as a point of Miscellaneous
serialization in the system. The primary example is Many modules that do callbacks offer a fork boolean
in the window system where input events can arrive parameter in their interface, indicating whether or
from a number of’ different sources. They are not the called-back procedure is to be called directly
handled by a single thread in order to preserve their or in a forked thread. The default is almost always
ordering. This same paradigm is present in most TRUE, meaning the callback will be forked.
other window systems and in many cases it is the Unforked callbacks are usually intended for experts,
only paradigm. In the Macintosh, Microsoft because they make future execution of the calling
Windows, and X programming models, for example, thread within the module dependent on successful
each application runs in a serialize thread that completion of the client callback.
pulls events from a queue associated with the
4.9 Paradigm summary
application’s window.
Table 4 summarizes the absolute and relative
4.7 Concurrency exploiters frequencies of the paradigms in our systems. Note
Concurrency exploiters are threads created that in keeping with our emphasis on using threads
specifically to make use of multiple processors. as program structuring devices this is a static count.
They tend to be very problem-specific in their (Threads may be counted in more than one category
details. Since our systems have only relatively because they change their behavior.) Some threads
recently begun to run on multiprocessors we were seem not to fit easily into any category. These are
not surprised to find very few concurrency exploiters captured in the “Unknown or other” entries.
in them. Our categories seem to apply well to other systems.
4.8 Encapsulated forks For instance, Lampson and Redell describe the
applications Pilot, Violet and Gateway (which,
One way that our systems promote use of common although Mesa-based, share no code with the
thread paradigms is by providing modules that systems we examined), but do not identify general
encapsulate the paradigms. This section describes types of threads [Lampson80]. Yet from the
three such packages used frequently in our systems. published description one can deduce the following:
DelayedFork and PeriodicalFork Pilot: almost all sleepers.
Delay edFork expresses the paradigm of a one-shot. Violet: sleepers, one-shots and work deferral.
It calls a procedure at some time in the future. Gateway: sleepers and pumps.
Although one-shots are common in our system,
Delay edFork is only used in our window systems.
(Its limited use might be because it appeared only
recently. )
PeriodicalFork is simply a Delay edFork that
repeats over and over again at fixed intervals. It
encapsulates the sleeper paradigm where the
100
Table 4. Static Counts Deadlock avoiders also are usually very simple, but
Cedar GVX the overall locking schemes in which they are
involved are often very, very complicated (and far
Defer work 108 31% 77 33%
beyond the scope of this paper).
Pumps
General pumps 48 14% 33 1470 5.2 Hard thread uses
Slack processes 7 27o 2 1%
Sleepers 67 19% 15 6% Some of the thread paradigms seem much more
Oneshots 25 770 11 570 difficult. First, there is little guidance in the
Deadlock avoid 35 1070 6 3% literature and in our experience for using
Task rejuvenate 11 3% o o% concurrency exploiters in interactive systems. The
Serializes 5 1% 7 3% arrival of relatively low-cost workstations and
Encapsulated fork 14 4% 5 2% operating systems supporting multiprocessing offers
Concurrency new incentives to understand this paradigm better.
exploiters 3 170 0 0% Second, designing slack processes and other pumps
Unknown orother225 770 78 33%’ in paths with timing constraints continues to
TOTAL 348 100% 234 100% challenge both the slack process implementors and
the PCR implementors. Bad performance in the
pipelines that provide feedback for typing (character
5. Issues in thread use echoing) and mouse motion is immediately apparent
to the user, yet improving the performance is often
Given a modern platform providing threads, system
very difficult.
builders have the option of using thread primitives
to accomplish tasks that they would have used other One instance of a poorly performing slack process in
techniques to accomplish on other platforms. The a user feedback pipeline involves sending requests
designer must balance the modest cost of creating a to the X server. Good X window system
thread against the benefits in structural performance requires batching communication with
simplification and concurrency that would accrue the server and merging overlapping requests. In
from its introduction. In addition to the cost of one of our systems, the batching is performed using
creating it, a thread incurs a modest ongoing cost for the slack process paradigm embodied in a high
the virtual memory occupied by its stack. Our PCR priority thread. The buffer thread accumulates
thread implementation allocates virtual memory for paint requests, merges overlapping requests and
the maximum possible stack size of each thread. If sends them only occasionally to the X server. In the
there is very little state associated with a thread usual producer-consumer style, an imaging thread
this may be a very inefficient use of memory puts paint requests on a queue for the buffer thread
[Draves911. and issues a NOTIFY to wake it up. In order to
gather multiple paint requests the buffer thread
5.1 Easy thread uses must not act on the single paint request in its queue
when it first wakes up so it YIELDs to cede the
Our programmers have become very adept at using processor to allow more paint requests to be
the sleeper, oneshot, pump (in paths without critical produced by the imaging thread; or so we hope. A
timing constraints) and work deferrer paradigms. problem occurs when the buffer thread is a higher
For the most part these require little interaction priority thread than the image threads that feed it:
between threads beyond taking care to provide the scheduler always chooses the buffer thread to
mutual exclusion (monitors ) for shared data. Pump run, not the image thread. Consequently the buffer
threads interact with threads on either side of them thread sends the paint request on to the X server
in pipelines, but the interactions generally follow and no merging occurs. The result is a high rate of
well-known producer-consumer patterns. Using thread and process switching and much more work
FORK to create sleeper threads has fallen into done by the X server than should be necessary,
disfavor with the advent of PCR thread
Fixing the problem by lowering the priority of the
implementation: 100 kilobytes for each of hundreds
buffer thread is clearly wrong, since that could
of sleepers’ stacks is just too expensive. The
cause starvation of screen painting. We believe that
PeriodicalProcess module, for timeout driven
the architecture of having a producer thread notify a
sleepers, and other sleeper encapsulations often can
consumer (buffer) thread periodically of work to do
accomplish the same thing using closures to
is a good one, so we did not consider changing the
maintain the little bit of state necessary between
basic paradigm by which these two threads interact.
activations.
Rewriting the buffer thread to simply sleep waiting
for a timeout before sending its events doesn’t work
2The large number of unknown threads in GVX is due to our either, for reasons discussed in Section 6.3.
relative unfamiliarity with this code, rather reflecting any
significant difference in paradigm use We fixed the immediate problem by creating a new
101
yield primitive, called YieldButNotToMe which WHILE NOT (condition) DO WAIT cv END
gives the processor to the highest priority ready
not the
thread other than its caller, if such a thread exists.
Most of the time the image thread is the thread IF NOT (condition) THEN WAIT cv
favored with the extra cycles and there is a big which would be appropriate with Hoare’s original
improvement in the system’s perceived monitors.
performance. Fewer switches are made to the X
The lF’-based approach will work in Mesa with
server, the buffer thread becomes more effective at
sufficient constraints on the number and behavior of
doing merging, there is less time spent in thread
the threads using the monitor, but its use cannot be
and process switching, and the image thread gets
recommended. The practice has been a continuing
much more processor resource over the same time
source of bugs as programs are modified and the
interval. The result is that the user experiences
correctness conditions become untrue.
about a three-fold performance improvement.
Unfortunately, the semantic compromise entailed in Second, there were cases where timeouts had been
YieldButNotToMe and the uncertainty that the introduced to compensate for missing NOTIFYS
additional cycles will go to the correct thread (bugs), instead of fixing the underlying problem.
suggest that the final solution to these problems The problem with this is that the system can become
still eludes us. timeout driven—it apparently works correctly but
slowly. Debugging the poor performance is often
Finally, even beyond the difficulties encountered
harder than figuring out why a system has stopped
with priorities in managing a pipeline, we found
due to a missing NOTIFY. Of course, legitimate
priorities to be problematic in general. Birrell
timeouts can mask an omitted NOTIFY as well.
describes a stable priority inversion in which a high
priority thread waits on a lock held by a low priority
thread that is prevented from running by a middle-
5.4 When a fork fails
priority cpu hog [Birrel191, pp. 99-100]. Like Birrell,
we chose not to incur the implementation overhead Sometimes an attempt to fork may fail for lack of
of providing priority inheritance from blocked resources. As with many other resource allocation
threads to threads holding locks. (Notice that the failures it’s difficult to plan a response. Earlier
problem occurs for abstract resources such as the versions of the systems would raise an error when a
condition associated with a CV as well as for real FORK failed: the standard programming practice
resources such as locks: the thread implementation was to catch the error and to try to recover, but good
has little hope of automatically adjusting thread recovery schemes seem never to have been worked
priority in such situations.) The problem is not out. The resulting code seems overly complex to no
hypothetical: we experienced enough real problems good end: the machinery for catching the error is
with priority inversions that we found it necessary always set up even though once an error is caught
to put the following two workarounds into our nobody really knows what to do about it. Memory
systems. First, for one particular kind of lock in the allocation failures present similar problems.
system, PCR does donate cycles from a blocked
Our more recent implementations simply wait in
thread to the thread that is blocking it. This is done the fork implementation for more resources to
only for the per-monitor metalock that locks each become available, but the behaviors seen by the
monitor’s queue of waiting threads. It is not done for
user, such as long delays in response or even
monitors themselves, where we don’t know how to complete unresponsiveness, go unexplained.
implement it efficiently. Second, PCR utilizes a
high-priority sleeper thread (which we call the A number of techniques may be adopted to avoid
SystemDaemon) that regularly wakes up and running out of resources: unfortunately, the better
donates, using a directed yield, a small timeslice to we succeed at that, the less experience we gain with
another thread chosen at random. In this way we techniques for appropriately recovering from the
ensure that all ready threads get some cpu resource, situation.
regardless of their priorities.
5.5 On robustness in a changing environment
5.3 Common mistakes The archeology of thread-related bugs taught us two
other things worth mentioning as well. First, we
Our dynamic and static inspections of old code
found many instances of timeouts and pauses with
revealed occasional correctness and performance
ridiculous values. These values presumably were
problems caused by improper thread usage. Two
chosen with some particular now-obsolete processor
questionable practices stood out.
speed or network architecture in mind. For user
First, we saw many instances of WAIT code that did interface-related timeouts, values based on wall-
not recheck the predicate associated with the clock time are appropriate, but timeouts related to
condition variable. Recall that proper use of WAIT processor speeds, or more insidiously, to expected
when using Mesa monitors is network server response times, are more difficult to
specify simply for all time. This may be an area of short timeout after which the mutex was released,
future research. For instance, dynamically tuning allowing other threads to continue. In contrast,
application timeout values based on end-to-end with the introduction of a reading thread, the client
system performance may be a workable solution. timeout is handled perfectly by the condition
variable timeout mechanism and priority inversion
Second, we saw several places where the correctness
can only occur during the short time period when a
of threaded code depended on strong memory
low-priority thread checks to see if there are events
ordering, an assumption no longer true in some
on the input queue.
modern multiprocessors with weakly ordered
memory [Frailong931[Sites931. The monitor A second benefit of introducing this thread concerns
implementation for weak ordering can use memory interaction between output requests and input
barrier instructions to ensure that all monitor- events. The X specification requires that the output
protected data access is consistent, but other uses queue be flushed whenever a read is done on the
that would be correct with strong ordering will not input stream. This ensures that any commands that
work, As a simple example, imagine a thread that might trigger a response are delivered to the server
once a minute constructs a record of time-date before the client waits on the reply. The modified
values and stores a pointer to that record into a Xlib retained this behavior, but the short timeout on
global variable. Under the assumptions of strong the read operations (to handle the problem described
ordering and atomic write of the pointer value, this above ) caused an excessive number of output
is safe. Under weak ordering, readers of the global flushes, defeating the throughput gains of batching
variable can follow a pointer to a record that has not requests. With the introduction of a reading thread,
yet had its fields filled in. As another example, however, there is no need to couple the input and
Birrell offers a performance hint for calling an output together. The reading thread can block
intialization routine exactly once [Birrel191, p. 97]. indefinitely and other mechanisms such as an
Under weak ordering, a thread can both believe that explicit flush by clients or a periodic timeout by a
the initializer has already been called and not yet be maintenance thread ensure that output gets flushed
able to see the initialized data. in a timely manner.
Another difference in the two approaches is the
5.6 Some contrasting experiences: multi- introduction of an extra thread for the batching of
threading and X windows graphics requests. Both systems do batching on a
higher level to eliminate unnecessary X requests.
The usual, single-threaded client interface to an X Xl uses the slack process mentioned previously. It
server is through Xlib, a library that translates an makes the connection to the server asynchronous in .
X client’s procedural interface into the message- order to improve throughput, especially when
oriented interface of the X server. Xlib does the performing many graphics operations. The modified
required translation and manages the 1/0 Xlib uses external knowledge of when the painting
connection to the server, both for reading events and is finished to trigger a flush of the batched requests.
writing commands. We studied two approaches to This limits asynchronous graphics operations and
using X windows from a multi-threaded client. One leads to a few superfluous flushes.
approach uses Xlib, modified only to make it thread-
In summary, this example illustrates the benefit of
safe [Schmitmann93]. The other approach uses Xl,
introducing an additional thread to help manage
an X client library designed from scratch with
concurrency and interactions with external 1/0
multi-threading in mind [Jacobi92].
events.
A major difference between the two approaches
concerns management of the 1/0 connection to the 6. Issues in thread implementation
server. Xl introduced a new serializing thread that
was associated with the 1/0 connection. The job of 6.1 Spurious lock conflicts
this thread was solely to read from the 1/0
A spurious lock conflict occurs between a thread
connection and dispatch events to waiting threads.
notifying a CV and the thread that it awakens.
In contrast, the modified Xlib allowed any client
thread to do the read with a monitor lock on the Birrell describes its occurence on a multiprocessor:
library providing serialization. There were two the scheduler starts to run the notified thread on
problems with this: priority inversion and honoring another processor while the notifying thread, still
running on its processor, holds the associated
the clients’ timeout parameter on the GetEvent
monitor lock. The notifyee runs for a few
routine. When a client thread blocks on the read
microseconds and then blocks waiting for the
call it holds the library mutex. A priority inversion
monitor lock. The cost of this behavior is that of
could occur if the thread were preempted.
Furthermore, it is not possible for other threads to useless trips through the scheduler made by the
notifyee’s processor [Birrel191].
timeout on their attempt to obtain the library
mutex, Therefore, each read had to be done with a We observed this phenomenon even on a
103
uniprocessor, where it occurs when the waiting one second before being sent and the user would
thread has higher priority than the notifying observe very bursty screen painting. If the quantum
thread. Since the Mesa language does not allow were 1 millisecond, then the YieldButNotToMe
condition variable notifies outside of monitor locks, would yield only very briefly and we would be back
Birrell’s technique of moving the NOTIFY out of the to the start of our problems again.
locked region was not applicable. In our systems the Above we mentioned that it does not work to rewrite
fix (defer processor rescheduling, but not the
the buffer thread to sleep for a timed interval,
notification itself, until after monitor exit) was
instead of doing a yield. The reason is that the
made in the runtime implementation. The changed smallest sleep interval is the remainder of the
implementation of NOTIFY prevents the problem
scheduler quantum. Our 50 millisecond quantum is
both in the case of interpriority notifications and on a little bit too long for snappy keyboard echoing and
multiprocessors. line drawing, both instances where immediate
response is more important than the throughput
6.2 Priorities
improvement achieved by buffering. However, if the
PCR approximates a strict priority scheduler, by scheduler quantum were 20 milliseconds, using a
which we mean that if a process of a given priority is timeout instead of a yield in the buffer thread would
currently scheduled to run on a processor, no process work fine.
with higher priority is ready to run. As we have We conclude that the choice of scheduler quantum is
seen in Section 5.2, strict priority is not a desirable not to be taken lightly in the design of interactive
model on which to run our client code: a model that thread systems, since it can severely affect the
provides some cpu resource to all runnable threads performance of different, correct,
has proven necessary to overcome stable priority multiprogramming algorithms.
inversions. Similarly, the YieldButNotToMe and
SystemDaemon hacks, also described above, violate
7. Conclusions
strict priority semantics yet have proven useful in
making our system perform well. The We have analyzed two interactive computing
SystemDaemon hack pushes the thread model a bit systems that make heavy use of light-weight
in the direction of proportional fair-share threads and have been in daily use for many years
scheduling (threads at each priority progress at a by many people. The model of threads both these
rate proportional to a function of the current systems use is one based on preemptable threads
distribution of threads among priorities), a model that use monitors and condition variables to control
intuitively better suited to controlling long-term thread interactions. Both systems run on top of the
average behavior than to controlling moment-by- Portable Common Runtime, which provides
moment processor allocation to meet near-real-time preemptable user-level threads based on a mostly
requirements. priority-based scheduling model.
We do not regard this as a satisfactory state of Our analysis has focused on how people use threads
affairs. These implementation hacks mean that the for program structuring rather than for achieving
thread model is incompletely specified with respect multiprocessor performance. As such, we have
to priorities, adversely affecting our ability to focused more on a static analysis of program code,
reason about existing code and to provide guidance coupled with an analysis of thread micro-behavior,
for engineering new code. Priority inversions and than on macroscopic thread runtime statistics.
techniques for avoiding them are the subjects of
These systems exemplify some paradigms that may
considerable research in the realtime computing
be useful to the thread programmer who is ready for
context [Sha90][Pilling911. We believe that
more advanced thread uses, such as slack processes,
someone should investigate these techniques for
serializes, deadlock avoiders and task rejuvenators.
interactive systems and report on the result.
These systems also show that even very experienced
6.3 The effect of the time-slice quantum communities may struggle at times to use threads
well. Some thread paradigms using monitor
Only after several months of study of the individual
mechanisms are easy for programmers to use;
thread switching events of the X server slack
others, such as slack processes and priorities,
process did we realize the importance of the time-
challenge both application programmers and thread
slice quantum. The end of a timeslice ends the effect
system implementors.
of a YieldButNotToMe or a directed yield. What we
did not realize for a long time is that it is the 50 One of our major conclusions is a suggestion for
millisecond quantum that is clocking the sending of future work in this area: there is still much to be
the X requests from the buffer thread. That is, the learned from a careful analysis of large systems.
only reason this performs well is that the quantum The more unusual paradigms described in this
is 50 milliseconds. For instance, if the quantum paper, such as task rejuvenation and deadlock
were 1 second, then X events would be buffered for avoidance, arose from a relatively small community
104
over a small number of years. There are likely other [Owicki891 Owicki, S. “Experience with the Firefly
innovative uses of threads waiting to be discovered. Multiprocessor Workstation.” Research Report
Another area of future work is to explore the work 51, Digital Equipment Corp. Systems
from the real-time scheduling community in the Research Center, September, 1989.
context of large, interactive systems. Both strict [Pier881 K. Pier, E. Bier, M. Stone. “An Introduction
priority scheduling and fair-share priority to Gargoyle: an Interactive Illustration Tool.”
scheduling seem to complicate rather than ease the J.C. van Vliet (editor), Document
programming of highly reactive systems. Manipulation and Typography, Proceedings of
Finally, reading code and microscopic analysis the Int 7 Conference on Electronic Publishing,
taught us new things about systems we had created Document Manipulation and Typography, Nice
and used over a ten year period. Even after a year of (France), April 20–22, 1988. Cambridge
looking at the same 100 millisecond event histories University Press, 1988, pp. 223–238.
we are seeing new things in them. To understand [Pilling91] M. Pilling. “Dangers of Priority as a
systems it is not enough to describe how things S~ructuring Principle ~or Real–Time
should be; one also needs to know how they are. Languages.” Australian Computer Science
Bibliography Communications, 13(1), February 1991, pp.
[Accetta86] Accetta, R. Baron, W. Bolosky, D. 18–1 - 18–10.
Golub, R. Rashid, A. Tevanian, M. Young, Powel191] M. Powell, S. Kleiman, S. Barton, D.
“Mach: A New Kernel Foundation for UNIX Shah, D. Stein, M. Weeks, “SunOS Multi–
Development.” Proceedings of the Summer thread Architecture.” Proceedings of the
1986 USENIX Conference, July 1986. Winter 1991 USENIX Conference, Jan. 1991,
pp. 65–80.
[Bier921 E. Bier. “EmbeddedButtons: Supporting
Buttons in Documents.” ACM Transactions on Scheifler92] R. Scheifler, J. Gettys. X Window
Information Systems, 10(4), October 1992, System: The Complete Reference to Xlib, X
pages 381–407, Protocol, ICCCM, XLFD. Third edition.
Digital Press, 1992.
[Birrel191] A. Birrell. “An Introduction to
[Schmitmann931 C. Schmidtmann, M Tao, S. Watt.
Programming with Threads.” in Systems
“Design and Implementation of a Multi–
Programming with Modula–3, G. Nelson
Threaded Xlib.” Proceedings of the Winter 1993
editor. Prentice Hall, 1991, pp. 88–118.
Usenix Conference, Jan. 1993, pp 193–204.
[Custer931 H. Custer. Inside Windows NT. Microsoft
[Sha901 L. Sha, J. Goodenough. “Real-Time
Press, 1993.
Scheduling Theory and Ada.” IEEE Computer,
[Draves91] R. Draves, B. Bershad, R. Rashid, R. 23(4), April 1990, pp 53–62.
Dean. “Using Continuations to Implement [Sites93] Sites, R. “Alpha AXP Architecture.”
Thread Management and Communication in CACM 36(2), February, 1993, pp. 33-44.
Operating Systems.” Proceedings of the 13th [Smith821 D. Smith, C. Irby, R. Kimball, B.
ACM Symposium on Operating Systems Verplank, E. Harslem. “Designing the STAR
in Operating
Principles, Systems Review, 25(5), User Interface.” BYTE Magazine, (7)4, April
October 1991. 1982, pp. 242–282.
[Frailong931 J. Frailong, M. Cekleov, P. Sindhu, J. [Swinehart86] D. Swinehart, P. Zellweger, R.
Gastinel, M. Splain, J. Price, A. Singhal. “The Beach, R. Hagmann. “A Structural View of the
Next–Generation SPARC Multiprocessing Cedar Programming Environment.” ACM
System Architecture.” Proceedings of Transactions on Programming Languages and
COMPCON93. Systems, (8)4, October,
1986.
[Halstead90] R. Halstead, Jr., D. Kranz. “A Replay [Weiser89] M. Weiser, A. Demers, C. Hauser. “The
Mechanism for Mostly Functional Parallel Portable Common Runtime Approach to
Programs.” DEC Cambridge Research Lab Interoperability.” Proceedings of the 12th ACM
Technical Report 90/6, November 13,1990. Symposium on Operating Systems Principles,
[Jacobi92] Jacobi, C. “Migrating Widgets.” in Operating Systems Review, 23(5), December
Proceedings of the 6th Annual X Technical 1989.
Conference in The X Resource, Issue 1,
January 1992, p. 157.
[Lampson80] B, Lampson, D, Redell, “Experience
with Processes and Monitors in Mesa. ”
Communications of the ACM, 23(2), Feb. 1980.
105

Using Threads in Interactive Systems: A Case Study: Star, Viewpoint, Globalvtew For X Windows Docuprint

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Using Threads in Interactive Systems: A Case Study: Star, Viewpoint, Globalvtew For X Windows Docuprint

Transféré par

Droits d'auteur :

Formats disponibles

Using Threads in Interactive Systems: A Case Study

Computer Science Laboratory

Vous aimerez peut-être aussi