Académique Documents
Professionnel Documents
Culture Documents
THEO HAERDER
Fadereich Inform&k, University of Kaisersluutem, West Germany
ANDREAS REUTER
IBM Research Laboratory, San Jose, California 95193
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
0 1983 ACM 0380-0300/83/1200-0287 $00.75
FUNDS-TRANSFER: PROCEDURE;
$BEGIN-TRANSACTION;
ON ERROR DO, /*in caee of error*/
$R.ESTORE-TRANSACTION; /*undo all work* j
GET INPUT MESSAGE; /*reacquire input*/
PUT MESSAGE (TRANSFER FAILED); /*report failure*/
GO TO COMMIT,
END,
GET INPUT MESSAGE; /*get and parse input*/
EXTRACT ACCOUNTDEBIT, ACCOUNT-CREDIT,
AMOUNT FROM MESSAGE;
$UPDATE ACCOUNTS /*do debit*/
SET BALANCE = BALANCE - AMOUNT
WHERE ACCOUNTS NUMBER = ACCOUNTSDEBIT;
$UPDATE ACCOUNTS /*do credit*/
SET BALANCE = BALANCE + AMOUNT
WHERE ACCOUNTS NUMBER = ACCOUNTS-CREDIT;
$INSERT INTO HISTORY /*keep audit trail*/
(DATE, MESSAGE);
PUT MESSAGE (TRANSFER DONE); /*report eucceee*/
COMMIT: /*commit updates*/
$COMMIT-TRANSACTION;
END; /*end of program*/
These situations and dependencies have explain the idea is the transfer of money
been investigated thoroughly by Bjork and from one account to another. The corre-
Davies in their studies of the so-called sponding transaction program is given in
spheres of control [Bjork 1973; Davies Figure 1.
1973,1978]. They indicate that data being The concept of a transaction, which in-
operated by a process must be isolated in cludes all database interactions between
some way that lets others know the degree $BEGIN-TRANSACTION and $COM-
of reliability provided for these data, that MIT-TRANSACTION in the above ex-
is, ample, requires that all of its actions be
executed indivisibly: Either all actions are
l Will the data be changed without notifi- properly reflected in the database or noth-
cation to others? ing has happened. No changes are reflected
l Will others be informed about changes? in the database if at any point in time
l Will the value definitely not change any before reaching the $COMMIT-TRANS-
more? ACTION the user enters the ERROR
clause containing the $RESTORE_
This ambitious concept was restricted to TRANSACTION. To achieve this kind of
use in database systems by Eswaran et al. indivisibility, a transaction must have four
[1976] and given its current name, the properties:
transaction. The transaction basically re-
flects the idea that the activities of a par- Atomic&y. It must be of the all-or-noth-
ticular user are isolated from all concurrent ing type described above, and the user must,
activities, but restricts the degree of isola- whatever happens, know which state he or
tion and the length of a transaction. Typi- she is in.
cally, a transaction is a short sequence of Consistency. A transaction reaching its
interactions with the database, using oper- normal end (EOT, end of transaction),
ators such as FIND a record or MODIFY thereby committing its results, preserves
an item, which represents one meaningful the consistency of the database. In other
activity in the users environment. The words, each successful transaction by deli-
standard example that is generally used to nition commits only legal results. This con-
interface. At the bottom, the database con- database literature, but for reasons that will
sists of some billions of bits stored on disk, become clear in the following sections we
which are interpreted by the DBMS into strictly distinguish between pages and
meaningful information on which the user blocks. A page is a fixed-length partition of
can operate. With each level of abstraction a linear address space and is mapped into
(proceeding from the bottom up);the ob- a physical block by the propagation control
jects become more complex, allowing more layer. Therefore a page can be stored in
powerful operations and being constrained different blocks during its lifetime in the
by a larger number of integrity rules. The database, depending on the strategy imple-
uppermost interface supports one of the mented for propagation control.
well-known data models, whether rela-
tional, networklike, or hierarchical. Access Path Management. This layer im-
Note that this mapping hierarchy is vir- plements mapping functions much more
tually contained in each DBMS, although complicated than those performed by sub-
for performance reasons it will hardly be ordinate layers. It has to maintain all phys-
reflected in the module structure. We shall ical object representations in the database
briefly sketch the characteristics of each (records, fields, etc.), and their related ac-
layer, with enough detail to establish our cess paths (pointers, hash tables, search
taxonomy. For a more complete description trees, etc.) in a potentially unlimited linear
see Haerder and Reuter [ 19831. virtual address space. This address space,
which is divided into fixed-length pages, is
File Management. The lowest layer op- provided by the upper interface of the sup-
erates directly on the bit patterns stored on porting layer. For performance reasons, the
some nonvolatile, direct access device like partitioning of data into pages is still visible
a disk, drum, or even magnetic bubble on this level.
memory. This layer copes with the physical
characteristics of each storage type and Navigational Access Layer. At the top of
abstracts these characteristics into fixed- this layer we find the operations and objects
length blocks. These blocks can be read, that are typical for a procedural data ma-
written, and identified by a (relative) block nipulation language (DML). Occurrences
number. This kind of abstraction is usually of record types and members of sets are
done by the data management system handled by statements like STORE, MOD-
(DMS) of a normal general-purpose oper- IFY, FIND NEXT, and CONNECT [CO-
ating system. DASYL 19781. At this interface, the user
navigates one record at a time through a
Propagation2 Control. This level is not hierarchy, through a network, or along log-
usually considered separately in the current ical access paths.
This term is introduced in Section 2.4, ita meaning Nonprocedural Access Layer. This level
is not essential to the understanding of this paragraph. provides a nonprocedural interface to the
Temporary Log
Supports Transaction UNDO,
Global UNDO, partial REDO
Host
Computer
Physical Copy of
the Database
Archive Copy of
the Database
database. With each operation the user can shown in Figure 4. It closely resembles the
handle sets of results rather than single situation that must be dealt with by most
records. A relational model with high-level of todays commercial database systems.
query languages like SQL or QUEL is a The host computer, where the applica-
convenient example of the abstraction tion programs and DBMS are located, has
achieved by the top layer [Chamberlin a main memory, which is usually volatile.3
1980; Stonebraker et al. 19761. Hence we assume that the contents of the
database buffer, as well as the contents of
On each level, the mapping of higher the output buffers to the log files, are lost
objects to more elementary ones requires whenever the DBMS terminates abnor-
additional data structures, some of which mally. Below the volatile main memory
are shown in Table 1. there is a two-level hierarchy of permanent
copies of the database. One level contains
2.2 The Storage Hierarchy: an on-line version of the database in direct
lmplementational Environment access memory; the other contains an ar-
chive copy as a provision against loss of the
Both the number of redundant data re- on-line copy. While both are functionally
quired to support the recovery actions de- situated on the same level, the on-line copy
scribed in Section 1 and the methods of is almost always up-to-date, whereas the
collecting such data are strongly influenced archive copy can contain an old state of the
by various properties of the different stor- database. Our main concern here is data-
age media used by the DBMS. In particular, base recovery, which, like all provisions for
the dependencies between volatile and per-
manent storage have a strong impact on
algorithms for gathering redundant infor- In some real-time applications main memory is sup-
ported by a battery backup. It is possible that in the
mation and implementing recovery meas- future mainframes will have some stable buffer stor-
ures [ Chen 19781. As a descriptional frame- age. However, we are not considering these conditions
work we shall use a storage hierarchy, as here.
fault tolerance, is based upon redundancy. buffer. The mapping hierarchy is com-
We have mentioned one type of redun- pletely correct.
dancy: the archive copy, kept as a starting The materialized database is the state
point for reconstruction of an up-to-date that the DBMS finds at restart after a crash
on-line version of the database (global without having applied any log information.
REDO). This is discussed in more detail in There is no buffer. Hence some page mod-
Section 4. To support this, and other recov- ifications (even of successful transactions)
ery actions introduced in Section 1, two may not be reflected in the on-line copy. It
types of log files are required: is also possible that a new state of a page
has been written to disk, but the control
Temporary Log. The information col- structure that maps pages to blocks has not
lected in this file supports crash recovery; yet been updated. In this case, a reference
that is, it contains information needed to to such a page will yield the old value. This
reconstruct the most recent database (DB) view of the database is what the recovery
buffer. Selective transaction UNDO re- system has to transform into the most re-
quires random access to the log records. cent logically consistent current database.
Therefore we assume that the temporary The physical database is composed of all
log is located on disk. blocks of the on-line copy containing page
Archive Log. This file supports global images-current or obsolete. Depending on
REDO after a media failure. It depends on the strategy used on Level 2, there may be
the availability of the archive copy and different values for one page in the physical
must contain all changes committed to the database, none of which are necessarily the
database after the state reflected in the current contents. This view is not normally
archive copy, Since the archive log is always used by recovery procedures, but a salva-
processed in sequential order, we assume tion program would try to exploit all infor-
that the archive log is written on magnetic mation contained therein.
tape.
With these views of a database, we can
distinguish three types of update opera-
2.3 Different Views of a Database tions-all of which explain the mapping
In Section 2.1, we indicated that the data- function provided by the propagation con-
base looks different at each level of abstrac- trol level. First, we have the modification of
tion, with each level using different objects page contents caused by some higher level
and interfaces. But this is not what we module. This operation takes place in the
mean by different views of a database in DB buffer and therefore affects only the
this section. We have observed that the current database. Second, there is the write
process of abstraction really begins at Level operation, transferring a modified page to
3, up to which there is only a more conven- a block on disk. In general, this affects only
ient representation of data in external stor- the physical database. If the information
age. At this level, abstraction is dependent about the block containing the new page
on which pages actually establish the linear value is stored in volatile memory, the new
address space, that is, which block is read contents will not be accessible after a crash;
when a certain page is referenced. In the that is, it is not yet part of the materialized
event of a failure, there are different pos- database. The operation that makes a pre-
sibilities for retrieving the contents of a viously written page image part of the ma-
page. These possibilities are denoted by terialized database is called propagation.
different views of the database: This operation writes the updated control
structures for mapping pages to blocks in a
The current database comprises all ob- safe, nonvolatile place, so that they are
jects accessible to the DBMS during normal available after a crash.
processing. The current contents of all If pages are always written to the same
pages can be found on disk, except for those block (the so-called update-in-place op-
pages that have been recently modified. eration, which is done in most commercial
Their new contents are found in the DB DBMS), writing implicitly is the equivalent
Modification Modification
(a)
Materialized Database
. d
Current Database
04 A V pq-q?qTI
System
Buffer c
I
D -
Current Database
v + {A,,s,c,D)
.
Materialized Database: V + (A,B,C,D)
Materialized Database
~<ro~Z~
u u
F1ur06. Current versus materialized database in -ATOMIC (a) and ATOMIC
(b and c) propagation.
mechanism [Lorie 19771. The mapping of been reduced to safely updating one record,
page numbers to block numbers is done by which can be done in a simple way. For
using page tables. These tables have one details, see Lorie [1977].
entry per page containing the block number There are other implementations for
where the page contents are stored. The ATOMIC propagation. One is based on
shadow pages, accessed via the shadow page maintaining two recent versions of a page.
Table V, preserve the old state of the ma- For each page access, both versions have to
terialized database. The current version is be read into the buffer. This can be done
defined by the current page Table V. Be- with minimal overhead by storing them in
fore this state is made stable (propagated), adjacent disk blocks and reading them with
all changed pages are written to their new chained I/O. The latest version, recognized
blocks, and so is the current page table. If by a time stamp, is kept in the buffer; the
this fails, the database will come up in its other one is immediately discarded. A mod-
old state. When all pages have been written ified page replaces the older version on disk.
related to the new state, ATOMIC propa- ATOMIC propagation is accomplished by
gation takes place by changing one record incrementing a special counter that is re-
on disk (which now points to V rather lated to the time stamps in the pages. De-
than V) in a way that cannot be confused tails can be found in Reuter [ 19801. An-
by a system crash. Thus the problem of other approach to ATOMIC propagation
indivisibly propagating a set of pages has has been introduced under the name dif-
Computing Surveys, Vol. 15, No. 4, December 1983
298 9 T. Haerder and A. Reuter
ferential files by Severance and Lohman temporary log file from which to start re-
[ 19761. Modified pages are written to a covery. We have not discussed the contents
separate (differential) file. Propagating of the log files for the reason that the type
these updates to the main database is not and number of log data to be written during
ATOMIC in itself, but once all modifica- normal processing are dependent upon the
tions are written to the differential file, state of the materialized database after a
propagation can be repeated as often as crash. This state, in turn, depends upon
wished. In other words, the process of copy- which method of page allocation and prop-
ing modified pages into the materialized agation is used.
database can be made to appear ATOMIC. In the case of direct page allocation and
A variant of this technique, the intention 1ATOMIC propagation, each write opera-
list, is described by Lampson and Sturgis tion affects the materialized database. The
[1979] and Sturgis et al. [1980]. decision to write pages is made by the buffer
Thus far we have shown that arbitrary manager according to buffer capacity at
sets of pages can be propagated in an points in time that appear arbitrary. Hence
ATOMIC manner using indirect page allo- the state of the materialized database after
cation. In the next section we discuss how a crash is unpredictable: When recent mod-
these sets of pages for propagation should ifications are reflected in the materialized
be defined. database, it is not possible (without further
provisions) to know which pages were mod-
3. CRASH RECOVERY ified by complete transactions (whose con-
tents must -be reconstructed by partial
In order to illustrate the consequences of
REDO) and which pages were modified by
the concepts introduced thus far, we shall incomplete transactions (whose contents
present a detailed discussion of crash re-
must be returned to their previous state by
covery. First, we consider the state in which
a database is left when the system termi- global UNDO). Further possibilities for
providing against this situation are briefly
nates abnormally. From this we derive the
discussed in Section 3.2.1.
type of redundant (log) information re-
quired to reestablish a transaction-consist- In the case of indirect page allocation
ent state, which is the overall purpose of and ATOMIC propagation, we know much
DB recovery. After completing our classi- more about the state of the materialized
database after crash. ATOMIC propagation
fication scheme, we give examples of recov-
ery techniques in currently available data- is indivisible by any type of failure, and
base systems. Finally, we present a table therefore we find the materialized database
containing a qualitative evaluation of all to be exactly in the state produced by the
instances encompassed by our taxonomy most recent successful propagation. This
(Table 4). state may still be inconsistent in that not
Note that the results in this section also all updates of complete transactions are
apply to transaction UNDO-a much sim- visible, and some effects of incomplete
pler case of global UNDO, which applies transactions are. However, ATOMIC prop-
agation ensures that a set of related pages
when the DBMS is processing normally
and no information is lost. is propagated in a safe manner by restrict-
ing propagation to points in time when the
3.1 State of the Database current database fulfills certain consistency
after a Crash constraints. When these constraints are
satisfied, the updates can be mapped to the
After a crash, the DBMS has to restart by materialized database all at once. Since the
applying all the necessary recovery actions current database is consistent in terms of
described in Section 1. The DB buffer is the access path management level-where
lost, as is the current database, the only propagation occurs-this also ensures that
view of the database to contain the most all internal pointers, tree structures, tables,
recent state of processing. Assuming that etc. are correct. Later on, we also discuss
the on-line copy of the database is intact, schemes that allow for transaction-consist-
there are the materialized dathe and the ent propagation.
The state of the materialized database disk by some replacement algorithm man-
after a crash can be summarized as follows: aging the database buffer. Ideally, this hap-
pens at points in time determined solely by
1ATOMIC Propagation. Nothing is buffer occupation and, from a consistency
known about the state of the materialized perspective, seem to be arbitrary. In gen-
database; it must be characterized as cha- eral, even dirty data, that is, pages modified
otic. by incomplete transactions, may be written
ATOMIC Propagation. The materialized to the physical database. Hence the UNDO
database is in the state produced by the operations described earlier will have to
most recent propagation. Since this is recover the contents of both the material-
bound by certain consistency constraints, ized database and the external storage me-
the materialized database will be consistent dia. The only way to avoid this requires
(but not necessarily up-to-date) at least up that the buffer manager be modified to
to the third level of the mapping hierarchy. prevent it from writing or propagating dirty
In the case of 1ATOMIC propagation, pages under all circumstances. In this case,
one cannot expect to read valid images for UNDO could be considerably simplified:
all pages from the materialized database If no dirty pages are propagated, global
after a crash, it is inconsistent on the prop- UNDO becomes virtually unnecessary
agation level, and all abstractions on higher that is, if there are no dirty data in the
levels will fail. In the case of ATOMIC materialized database.
propagation, the materialized database is If no dirty pages are written, transaction
consistent at least on Level 3, thus allowing UNDO can be limited to main storage
for the execution of operations on Level 4 (buffer) operations.
(DML statements).
The major disadvantage of this idea __. is
that very large database buffers would be
3.2 Types of Log Information
to Support Recovery Actions required (e.g., for long batch update trans-
actions), making it generally incompatible
The temporary log file must contain all the with existing systems. However, the two
information required to transform the ma- different methods of handling modified
terialized database as found into the most pages introduced with this idea have im-
recent transaction-consistent state (see portant implications with UNDO recovery.
Section 1). As we have shown, the mate- We shall refer to these methods as:
rialized database can be in more or less
defined states, may or may not fulfill con- STEAL. Modified pages may be written
sistency constraints, etc. Hence the number and/or propagated at any time.
of log data will be determined by what is -STEAL. Modified pages are kept in
contained in the materialized database at buffer at least until the end of the trans-
the beginning of restart. We can be fairly action (EOT).
certain of the contents of the materialized The definition of STEAL can be based
database in the case of ATOMIC propaga-
schemes not either
on writing or propagating, which are
tion, but the result of -ATOMIC discriminated in 1ATOMIC schemes.
have been shown to be unpredictable. In the case of ATOMIC propagation both
There are, however, additional measures to variants of STEAL are conceivable, and
somewhat reduce the degree of uncertainty each would have a different impact on
resulting from -rATOMIC propagation, as UNDO recovery actions; in the case of
discussed in the following section. lSTEAL, no logging is required for UNDO
purposes.
3.2.1 Lkpendencies between Buffer Manager
ahd Recovery Component 3.2.1.2 Buffer Management and REDO
Recovery Actions. As soon as a transaction
3.2.1.1 Buffer Management and UNDO commits, all of its results must survive any
Recovery Actions. During the normal mode subsequent failure (durability). Committed
of operation, modified pages are written to updates that have not been propagated to
Computing Surveys, Vol. 15, No. 4, December 1983
300 l T. Haerder and A. Reuter
the materialized database would definitely Table 2. Classification Scheme for Log Data
be lost in case of a system crash, and so
there must be enough redundant informa-
tion in the log file to reconstruct these
results during restart (partial REDO). It is g
conceivable, however to avoid this kind of
recovery by the following technique.
During Phase 1 of EOT processing all
pages modified by this transaction are data are applicable to the crash state of the
propagated to the materialized database; materialized database.
that is, their writing and propagation are Log data are redundant information, col-
enforced. Then we can be sure that either lected for the sole purpose of recovery from
the transaction is complete, which means a crash or a media failure. They do not
that all of its results are safely recorded (no undergo the mapping process of the data-
partial REDO), or in case of a crash, some base objects, but are obtained on a certain
updates are not yet written, which means level of the mapping hierarchy and written
that the transaction is not successful and directly to nonvolatile storage, that is, the
must be rolled back (UNDO recovery ac- log files. There are two different, albeit not
tions) . fully orthogonal, criteria for classifying log
Thus we have another criterion concern- data. The first is concerned with the type
ing buffer handling, which is related to the of objects to be logged. If some part of the
necessity of REDO recovery during restart: physical representation, that is, the bit pat-
tern, is written to the log, we refer to it as
FORCE. All modified pages are written physical logging; if the operators and their
and propagated during EOT processing. arguments are recorded on a higher level,
1FORCE. No propagation is triggered this is called logical logging. The second
during EOT processing. criterion concerns whether the state of the
The implications with regard to the gath- database-before or after a change-or the
ering of log data are quite straightforward transition causing the change is to be
in the case of FORCE. No logging is re- logged. Table 2 contains some examples for
quired for partial REDO; in the case of these different types of logging, which are
-rFORCE such information is required. explained below.
While FORCE avoids partial REDO, there Physical State Logging on Page Level.
must still be some REDO-log information The most basic method, which is still ap-
for global REDO to provide against loss of plied in many commercial DBMSs, uses the
the on-line copy of the database. page as the unit of log information. Each
time a part of the linear address space is
3.2.2 Classification of Log Data changed by some modification, insertion,
etc., the whole page containing this part of
Depending on which of the write and prop- the linear address space is written to the
agation schemes introduced above are being log. If UNDO logging is required, this will
implemented, we will have to collect log
be done before the change takes place,
information for the purpose of yielding the so-called before image. For
removing invalid data (modifications ef- REDO purposes, the resulting page state is
fected by incomplete transactions) from recorded as an after image.
the materialized database and
Physical Transition Logging on Page
supplementing the materialized database Level. This logging technique is based also
with updates of complete transactions
on pages. However, it does not explicitly
that were not contained in it at the time record the old and new states of a page;
of crash. rather it writes the difference between them
In this section, we briefly describe what to the log. The function used for computing
such log data can look like and when such the difference between two bit strings is
justified by the comparatively few benefits a tuple identified by its tuple identifier
yielded by logical transition logging on the (TID) from an old value to a new one, both
access path level. Hence logical transition of which are recorded. Inserting a tuple
logging on this level can generally be ruled entails modifying its initial null value to
out, but will become more attractive on the the given value, and deleting a tuple entails
next higher level. the inverse transition. Hence the log con-
tains the information shown in Figure 7.
Logical Logging on the Record-Oriented Logical transition logging obviously re-
Level. At one level higher, it is possible quires a materialized database that is con-
to express the changes performed by the sistent up to Level 3; that is, it can only
transaction program in a very compact be combined with ATOMIC propagation
manner by simply recording the update schemes. Although the number of log data
DML statements with their parameters. written are very small, recovery will be
Even if a. nonprocedural query language is more expensive than that in other schemes,
being used above this level, its updates will because it involves the reprocessing of some
be decomposed into updates of single rec- DML statements, although this can be
ords or tuples equivalent to the single- done more cheaply than the original proc-
record updates of procedural DB languages. essing.
Thus logging on this level means that only Table 3 is a summation of the properties
the INSERT, UPDATE, and DELETE op- of all logging techniques that we have de-
erations, together with their record ids and scribed under two considerations: What is
attribute values, are written to the log. The the cost of collecting the log data during
mapping process discerns which entries are normal processing? and, How expensive is
affected, which pages must be modified, etc. recovery based on the respective type of log
Thus recovery is achieved by reexecuting information? Of course, the entries in the
some of the previously processed DML table are only very rough qualitative esti-
statements. For UNDO recovery, of course, mations; for more detailed quantitative
the inverse DML statement must be exe- analysis see Reuter [ 19821.
cuted, that is, a DELETE to compensate Writing log information, no matter what
an INSERT and vice versa, and an UP- type, is determined by two rules:
DATE returned to the original values.
These inverse DML statements must be UNDO information must be written to
generated automatically as part of the reg- the log file before the corresponding up-
ular logging activity, and for this reason dates are propagated to the materialized
this approach is not viable for network- database. This has come to be known as
oriented DBMSs with information-bearing the write ahead log (WAL) principle
interrecord relations. In such cases, it can [Gray 19781.
be extremely expensive to determine, for REDO information must be written to
example, the inverse for a DELETE. De- the temporary and the archive log file
tails can be found in Reuter [ 19811. before EOT is acknowledged to the trans-
System R is a good example of a system action program. Once this is done, the
with logical logging on the record-oriented system must be able to ensure the trans-
level. All update operations performed on actions durability.
the tuples are represented by one general-
ized modification operator, which is not We return to different facets of these
explicitly recorded. This operator changes rules in Section 3.4.
Tl
TZ Begin B+B------c+c.
T4 Begin -E+E
BIcJin -1 cresh
UNDO sequence:
REDO sequance: -------
U(Ti.X) denotes UNDO information of transection Ti for object X
R(Ti,X) denotes REOO information of transaction Ti for object X
Transaction-C?~iented
some examples in this section. Checkpoint
generation involves three steps [Gray
19781: ..;,:.+j chwk[ j
to the temporary log file. ~wre 19. Scenario for transaction-oriented check-
During restart, _.the BEGIN-END_ _ points.
bracket is a clear indication as to whether
a checkpoint was generated completely or be considered incomplete and its effects
interrupted by a system crash. Sometimes will be undone. Hence the EOT record
checkpointing is considered to be a means of each transaction can be interpreted as
for restoring the whole database to some a BEGIN-CHECKPOINT and END-
previous state. Our view, however, focuses CHECKPOINT, since it agrees with our
on transaction recovery. Therefore to us a definition of a checkpoint in that it limits
checkpoint is a technique for optimizing the scope of REDO. Figure 10 illustrates
crash recovery rather than a definition of a transaction-oriented checkpoints (TOC).
distinguished state for recovery itself. In As can be seen in Figure 10, transaction-
order to effectively constrain partial oriented checkpoints are implied by a
REDO, checkpoints must be generated at FORCE discipline. The major drawback to
well-defined points in time. In the following this approach can be deduced from Figure
sections, we shall introduce four separate 9. Hot spot pages like pi will be propagated
criteria for determining when to start each time they are modified by a transac-
checkpoint activities. tion even though they remain in the buffer
for a long time. The reduction of recovery
3.3.2 Transaction-OrientedCheckpoints
expenses with the use of transaction-ori-
As previously explained, a FORCE disci- ented checkpoints is accomplished by im-
pline will avoid partial REDO. All modified posing some overhead on normal process-
pages are propagated before an EOT record ing. This is discussed in more detail in
is written to the log, which makes the trans- Section 3.5. The cost factor of unnecessary
action durable. If this record is not found write operations performed by a FORCE
in the log after a crash, the transaction will discipline is highly relevant for very large
Computing Surveys,Vol. 15, No. 4, December 1983
366 T. Hawder and A. Reuter
Checkpoint
Checkpoint Signal ci Generated
database buffers. The longer a page re- two subsequent checkpoints can be ad-
mains in the buffer, the higher is the prob- justed to minimize overall recovery costs.
ability of multiple updates to the same page In Figure 11, T3 must be redone com-
by different transactions. Thus for DBMSs pletely, whereas T4 must be rolled back.
supporting large applications, transaction- There is nothing to be done about Tl and
oriented checkpointing is not the proper T2, since their updates have been propa-
choice. gated by generating ci. Favorable as that
may sound, the TCC approach is quite un-
3.3.3 Transaction-ConsistentCheckpoints realistic for large multiuser DBMSs, with
the exception of one special case, which is
The following transaction-consistent check- discussed in Section 3.4. There are two
points (TCC) are global in that they save reasons for this:
the work of all transactions that have mod-
ified the database. The first TCC, when l Putting the system into a quiescent state
successfully generated, creates a transac- until no update transaction is active may
tion-consistent database. It requires that cause an intolerable delay for incoming
all update activities on the database be transactions.
quiescent. In other words, when the check- l Checkpoint costs will be high in the case
point generation is signaled by the recovery of large buffers, where many changed
component, all incomplete update transac- pages will have accumulated. With a
tions are completed and new ones are not buffer of 6 megabytes and a substantial
admitted. The checkpoint is actually gen- number of updates, propagating the mod-
erated when the last update is completed. ified pages will take about 10 seconds.
After the END-CHECKPOINT record For small applications and single-user sys-
has been successfully written, normal op- tems, TCC certainly is useful.
eration is resumed. This is illustrated in
Figure 11. 3.3.4 Action-Con&tent Chechpjnts
Checkpointing connotes propagating all
modified buffer pages and writing a record Each transaction is considered a sequence
to the log, which notifies the materialized of elementary actions affecting the data-
database of a new transaction-consistent base. On the record-oriented level, these
state, hence the name transaction-consist- actions can be seen as DML statements.
ent checkpoint (TCC). By propagating all Action-consistent checkpoints (ACC) can
modified pages to the database, TCC estab- be generated when no update action is being
lishes a point past which partial REDO will processed. Therefore signaling an ACC
not operate. Since all modifications prior means putting the system into quiescence
to the recent checkpoint are reflected in the on the action level, which impedes opera-
database, REDO-log information need only tion here much less than on the transaction
be processed back to the youngest END- level. A scenario is shown in Figure 12.
CHECKPOINT record found on the log. The checkpoint itself is generated in the
We shall see later on that the time between very same way as was described for the
Computing Surveys,Vol. 16. No. 4, December 1983
Principles of Transaction-Oriented Database Recovery 9 307
T3-
TSFE-E-BI
T7 --6-l
Ci-1 Ci system
Time ) Processing Delay Crash
for Actions
TCC technique. In the case of ACC, how- the log file rather than the pages them-
ever, the END-CHECKPOINT record in- selves. This can be done with two or three
dicates an action-consistent rather than a write operations, even with very large buff-
transaction-consistent database. Obviously ers, and helps to determine which pages
such a checkpoint imposes a limit on partial containing committed data were actually in
REDO. In contrast to TCC, it does not the buffer at the moment of a crash. How-
establish a boundary to global UNDO; how- ever, if there are hot spot pages, their
ever, it is not required by definition to do REDO information will have to be traced
so. Recovery in the above scenario means back very far on the temporary log. So,
global UNDO for TJ, T2, and T3. REDO although indirect checkpointing does re-
has to be performed for the last action of duce the costs of partial REDO, this does
T5 and for all of T6. The changes of T4 not in general make partial REDO inde-
and T7 are part of the materialized data- pendent of mean time between failure. Note
base because of checkpointing. So again, also that this method is only applicable
REDO-log information prior to the recent with iATOMIC propagation. In the case
checkpoint is irrelevant for crash recovery. of ATOMIC schemes, propagation always
This scheme is much more realistic, since takes effect at one well-defined moment,
it does not cause long delays for incoming which is a checkpoint; pages that have only
transactions. Costs of checkpointing, how- been written (not propagated) are lost after
ever, are still high when large buffers are a crash. Since this checkpointing method
used. is concerned only with the temporary log,
leaving the database as it is, we call it
3.3.5 Fuzzy Checkpoints fuzzy. A description of a particular imple-
mentation of indirect, fuzzy checkpoints is
In order to further reduce checkpoint costs, given by Gray [1978].
propagation activity at checkpoint time has The best of both worlds, low checkpoint
to be avoided whenever possible. One way costs with fixed limits to partial REDO, is
to do this is indirect checkpointing. Indirect achieved by another fuzzy scheme de-
checkpointing means that information scribed by Lindsay et al. [1979]. This
about the buffer occupation is written to scheme combines ACC with indirect check-
pointing: At checkpoint time the numbers
This means that the material&d database reflects a of all pages (with an update indicator) cur-
state produced by complete actions only; that is, it is
consistent up to Level 3 at the moment of checkpoint- rently in buffer are written to the log file.
ing. If there are no hot spot pages, nothing else
Computing Surveys, Vol. 15, No. 4, December 1983
308 l T. Haerder and A. Reuter
Propagation
-ATOMIC ATOMIC Strategy
/\ A Page
STEAL -STEAL STEAL -STEAL Replacement
/\ /\ /\ /\ EOT
FO,RCE K T );p$ FO,RCE ?I FO,RCE -FiVIC Processing
Checkpoint
TOC TCC ACC fuzzy TOC TCC fuzzv TOC TCC ACC TOC TCC Scheme
St? ?
g$;? f 5 $52
Example
z I e; z +zH
%
% 45 - x &
.r
-I
System Buffer
CyFc Five:
. ,
Figure 15. Recovery scenario Committed:
t%%;ATOMIC, STEAL, FORCE, T2: B, C
System Buffer
Currently Active:
7;; ;;; D t ; : @ m ip,R?sLy%Rg
ample of this type of implementation, based tions, as described in Section 3.2. EOT
on the shadow-page mechanism in System processing of T2 and T3 includes writing
R. This system uses action-consistent REDO information to the log, again using
checkpointing for update propagation, and logical transitions. When the system
hence comes up with a consistent material- crashes, the current database is in the state
ized database after a crash. More specifi- depicted in Figure 17; at restart the mater-
cally, the materialized database will be con- ialized database will reflect the most recent
sistent up to Level 4 of the mapping hier- checkpoint state. Crash recovery involves
archy and reflect the state of the most the following actions:
recent checkpoint; everything occurring UNDO the modification of A. Owing to
after the most recent checkpoint will have the STEAL policy in System R, incom-
disappeared. As discussed in Section 3.2, plete transactions can span several
with an action-consistent database one can checkpoints. Global UNDO must be ap-
use logical transition logging based on plied to all changes of failed transactions
DML statements, which System R does. prior to the recent checkpoint.
Note that in the case of ATOMIC propa- REDO the last action of T2 (modifi-
gation the WAL principle is bound to the cation of C) and the whole transaction
propagation, that is, to the checkpoints. In T3 (modification of C). Although they
other words, modified pages can be written, are committed, the corresponding page
but not propagated, without having written states are not yet reflected in the mate-
an UNDO log. If the modified pages pertain rialized database.
to incomplete transactions, the UNDO in-
Nothing has to be done with D since this
formation must be on the temporary log has not yet become part of the material-
before the pages are propagated. The same ized database. The same is true of T4.
is true for STEAL: Not only can dirty pages Since it was not present when ci was
be written; in the case of System R they generated, it has had no effect on the
can also be propagated. Consider the scen- materialized database.
ario in Figure 17.
Tl and T2 were both incomplete at 3.5 Evaluation of Logging
checkpoint. Since their updates (A and B) and Recovery Concepts
have been propagated, UNDO information
must be written to the temporary log. In Combining all possibilities of propagating,
System R, this is done with logical transi- buffer handling, and checkpointing, and
Computing Surveys, Vol. 15, No. 4, December 1983
Principles of Transactimz-Oriented Database Recovery l 311
Before System Crash After Restart
Committed:
T3: C
EOT processing FORCE -FORCE FORCE -FORCE FORCE -s FORCE FORCE _I FORCE
I I I I I I I I I I 1 I I
Notes:
Abbreviations: DC, device consistent (chaotic); AC, action consistent; TC, Transaction consistent.
Evaluation symbols: --, very low; -, low; +, high; ++, very high.
considering the overall properties of each iSTEAL does not allow for ACC). By re-
scheme that we have discussed, we can ferring the information in Table 4 to Figure
derive the evaluation given in Table 4. 13; one can see how existing DBMSs are
Table 4 can be seen as a compact sum- rated in this qualitative comparison.
mary of what we have discussed up to this Some criteria of our taxonomy divide the
point. Combinations leading to inherent world of DB recovery into clearly distinct
contradictions have been suppressed (e.g., areas:
l ATOMIC propagation achieves an ac- There are, of course, other factors influ-
tion- or transaction-consistent material- encing the performance of a logging and
ized database in the event of a crash. recovery component: The granule of log-
Physical as well as logical logging tech- ging (pages or entries), the frequency of
niques are therefore applicable. The checkpoints (it depends on the transaction
benefits of this property are offset by load), etc. are important. Logging is also
increased overhead during normal proc- tied to concurrency control in that the
essing caused by the redundancy required granule of logging determined the granule
for indirect page mapping. On the other of locking. If page logging is applied, DBMS
hand, recovery can be cheap when must not use smaller granules of locking
ATOMIC propagation is combined with than pages. However, a detailed discussion
TOC schemes. of these aspects is beyond the scope of this
l iATOMIC propagation generally results paper; detailed analyses can be found in
in a chaotic materialized database in the Chandy et al. [1975] and Reuter [1982].
event of a crash, which makes physical
logging mandatory. There is almost no 4. ARCHWE RECOVERY
overhead during normal processing, but
Throughout this paper we have focused on
without appropriate checkpoint schemes,
crash recovery, but in general there are two
recovery will more expensive.
types of DB recovery, as is shown in Figure
l All transaction-oriented and trans- 18. The first path represents the standard
action-consistent schemes cause high crash recovery, depending on the physical
checkpoint costs. This problem is (and the materialized) database as well as
emphasized in transaction-oriented
on the temporary log. If one of these is lost
schemes by a relatively high checkpoint or corrupted because of hardware or soft-
frequency. ware failure, the second path, archive re-
It is, in general, important when deciding covery, must be tried. This presupposes
which implementation techniques to that the components involved have inde-
choose for database recovery to carefully pendent failure modes, for example, if tem-
consider whether optimizations of crash re- porary and archive logs are kept on differ-
covery put additional burdens on normal ent devices. The global scenario for archive
processing. If this is the case, it will cer- recovery is shown in Figure 19; it illustrates
tainly not pay off, since crash recovery, it that the component archive copy actually
is hoped, will be a rare event. Recovery depends on some dynamically modified
components should be designed with mini- subcomponents. These subcomponents cre-
mal overhead for normal processing, pro- ate new archive copies and update existing
vided that there is fixed limit to the costs ones. The following is a brief sketch of some
of crash recovery. problems associated with this.
This consideration rules out schemes of Creating an archive copy, that is, copying
the ATOMIC, FORCE, TOC type, which the on-line version of the database, is a
can be implemented and look very appeal- very expensive process. If the copy is to be
ing at first sight. According to the classi- consistent, update operation on the data-
fication, the materialized database will base has to be interrupted for a long time,
always be in the most recent transaction- which is unacceptable in many applica-
consistent state in implementations of tions. Archive recovery is likely to be rare,
these schemes. Incomplete transactions and an archive copy should not be created
have not affected the materialized data- too frequently, both because of cost and
base, and successful transactions have because there is a chance that it will never
propagated indivisibly during EOT proc- be used. On the other hand, if the archive
essing. However appealing the schemes copy is very old, recovery starting from such
may be in terms of crash recovery, the a copy will have to redo too much work and
overhead during normal processing is too will take too long. There are two methods
high to justify their use [Haerder and Reu- to cope with this. First, the database can
ter 1979; Reuter 19801. be copied on the fly, that is, without inter-
Failure b
Supplement the
Archive Version by
Latest Increments (2)
rupting processing, in parallel with normal chive recovery in such an environment re-
processing. This will create an inconsistent quires the most recent archive copy, the
copy, a so-called fuzzy dump. latest incremental modifications to it (if
The other possibility is to write only the there are any), and the archive log. When
changed pages to an incremental dump, recovering the database itself, there is little
since a new copy will be different from an additional cost in creating an identical new
old one only with respect to these pages. archive copy in parallel.
Either type of dump can be used to create There is still another problem hidden in
a new, more up-to-date copy from the pre- this scenario: Since archive copies are
vious one. This is done by a separate off- needed very infrequently, they may be sus-
line process with respect to the database ceptible to magnetic decay. For this reason
and therefore does not affect DB operation. several generations of the archive copy are
In the case of DB applications running 24 usually kept. If the most recent one does
hours per day, this type of separate process not work, its predecessor can be tried, and
is the only possible way to maintain archive so on. This leads to the consequences illus-
recovery data. As shown in Figure 19, ar- trated in Figure 20.
4 4 1
0
Archive Copy
cl
Archive Copy
cl
Archive Copy
-
Generation n-2 Generation n-l Generation n 6
Archive
Log (1)
Archive
Mustbesync LogG!)
During Phase
Must be Synchronized
During Phese 1 of EOT
I,
0 2
Archive
COPY
(b)
Figure 21. Two possibilities for duplicating the archive log.
We must anticipate the case of starting That makes the log susceptible to magnetic
archive recovery from the oldest generation, decay, as well, but in this case generations
and hence the archive log must span the will not help; rather we have to duplicate
whole distance back to this point in time. the entire archive log file. Without taking
storage costs into account, this has severe ATOMIC/lATOMIC dichotomy defines
impact on normal DB processing, as is two different methods of handling low-level
shown in Figure 21. updates of the database, and also gives rise
Figure 21a shows the straightforward so- to different views of the database, both the
lution: two archive log files that are kept materialized and the physical database.
on different devices. If this scheme is to This proves to be useful in defining differ-
work, all three log tiles must be in the same ent crash states of a database.
state at any point in time. In other words, Buffer Handling. We have shown that
writing to these files must be synchronized interfering with buffer replacement can
at each EOT. This adds substantial costs to support UNDO recovery. The STEAL/
normal processing and particularly affects iSTEAL criterion deals with this concept.
transaction response times. The solution in EOT Processing. By distinguishing
Figure 21b assumes that all log information FORCE policies from -rFORCE policies we
is written only to the temporary log during can distinguish whether successful trans-
normal processing. An independent process actions will have to be redone after a crash.
that runs asynchronously then copies the It can also be shown that this criterion
REDO data to the archive log. Hence ar- heavily influences the DBMS performance
chive recovery finds most of the log entries during normal operation.
in the archive log, but the temporary log is Checkpointing. Checkpoints have been
required for the most recent information. introduced as a means for limiting the costs
In such an environment, temporary and of partial REDO during crash recovery.
archive logs are no longer independent from They can be classified with regard to the
a recovery perspective, and so we must events triggering checkpoint generation
make the temporary log very reliable by and the number of data written at a check-
duplicating it. The resulting scenario looks point. We have shown that each class has
much more complicated than the first one, some particular performance characteris-
but in fact the only additional costs are tics.
those for temporary log storage-which are
usually small. The advantage here is that Some existing DBMSs and implementa-
tion concepts have been classified and de-
only two files have to be synchronized dur- scribed according to the taxonomy. Since
ing EOT, and moreover-as numerical the criteria are relatively simple, each sys-
analysis shows-this environment is more
tem can easily be assigned to the appropri-
reliable than the first one by a factor of 2. ate node of the classification tree. This
These arguments do not, of course, ex-
classification is more than an ordering
haust the problem of archive recovery. Ap-
scheme for concepts: Once the parameters
plications demanding very high availability of a system are known, it is possible to draw
and fast recovery from a media failure will important conclusions as to the behavior
use additional measures such as duplexing
and performance of the recovery compo-
the whole database and all the hardware
(e.g., see TANDEM [N.d.]). This aspect of nent.
database recovery does not add anything ACKNOWLEDGMENTS
conceptually to the recovery taxonomy es- We would like to thank Jim Gray (TANDEM Com-
tablished in this paper. puters, Inc.) for his detailed proposals concerning the
structure and contents of this paper, and his enliiht-
5. CONCLUSION ening discussions of logging and recovery. Thanks are
also due to our colleagues Flaviu Cristian, She1 Fm-
We have presented a taxonomy for classi- kelstein, C. Mohan, Kurt Shoens, and Irv Traiger
fying the implementation techniques for (IBM Research Laboratory) for their encouraging
database recovery. It is based on four cri- comments and critical remarks.
teria:
Propagation. We have shown that update REFERENCES
propagation should be carefully distin- ASTRAHAN, M. M., BLASGEN, M. W., CHAMBERLIN,
guished from the write operation. The D. D., GRAY, J. N., KING, W. F., LINDSAY, B. G.,
Received January 1980; Revised May 1982; final revision accepted January 1984