Algorithms For Chemical Computations (Acs Symposium Series No 46)

Algorithms for Chemical Computations
R a l p h E. Christoffersen, The University of Kansas

EDITOR
Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.fw001
A symposium sponsored by the Division of Computers in Chemistry at the 171st Meeting of the American Chemical Society, New York, N.Y., Aug. 30, 1976.
ACS SYMPOSIUM SERIES
46
CHEMICAL SOCIETY
AMERICAN WASHINGTON, D. C. 1977
In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Library of CongressCIPData
Algorithms for chemical computations. (ACS symposium series; 46 ISSN 0097-6156) Includes bibliographical references and index. 1. ChemistryData processingCongresses. rithmsCongresses. 2. Algo-
I. Christoffersen, Ralph E., 1937. II. American Chemical Society. Division of Computers in Chemistry. III. Series: American Chemical Society. ACS symposium series; 46. QD39.3.E46A43 ISBN 0-8412-0371-7 540'.28'5 77-5030 ACSMC8 46 1-151
Copyright 1977 American Chemical Society All Rights Reserved. No part of this book may be reproduced or transmitted in any form or by any meansgraphic, electronic, including photocopying, recording, taping, or information storage and retrieval systemswithout written permission from the American Chemical Society.
PRINTED IN THE UNITED STATES OF AMERICA
ACS Symposium Series

R o b e r t F . G o u l d , Editor
Advisory
Board
Donald G. Crosby Jeremiah P. Freeman E. Desmond Goddard Robert A. Hofstader John L. Margrave Nina I. McClelland John B. Pfeiffer Joseph V. Rodricks Alan C. Sartorelli Raymond B. Seymour Roy L. Whistler Aaron Wold
FOREWORD
The ACS S Y M P O S I U M SERIES was founded in 1974 to provide a medium for publishing symposia quickly in book form. The format of the SERIES parallels that of the continuing A D V A N C E S I N C H E M I S T R Y SERIES except that in order to save time the papers are not typeset but are reproduced as they are submitted by the authors in camera-ready form. As a further means of saving time, the papers are not edited or reviewed except by the symposium chairman, who becomes editor of the book. Papers published in the ACS S Y M P O S I U M SERIES are original contributions not published elsewhere in whole or major part and include reports of research as well as reviews since symposia may embrace both types of presentation.
PREFACE
s computing hardware and software continues to pervade the various areas of chemical research, education, and technology, various important developments begin to emerge. For example, for areas in which large "number crunching" is required, larger and faster computing systems have been developed that incorporate parallel processing, which have provided substantial increases in speed of problem solving compared with sequential processing. In other areas, such as data acquisition and equipment control, minicomputers and "midicomputers" have been designed and built to provide substantial improvements in both the quality of the data collected and the implementation of new experiments that could not be performed without the computer system assistance. Equally important developments in software have also evolved, from the implementation of convenient timesharing systems for program development to the development of a variety of application program "packages" for use in various chemical research areas. While the limits achievable through better hardware design or more efficient programming of available algorithms are far from being reached, it is now becoming apparent that the algorithms themselves may present both substantial difficulties and opportunities for significant progress. In other words, it may no longer be a feasible strategy to assume that either a faster computer or a more efficiently programmed existing algorithm will be adequate in solving a given problem. To focus more clearly on this emerging area of importance, a symposium was organized as a part of the Fall American Chemical Society Meeting in San Francisco, on August 30, 1976. The goal was to bring together several experts in the development of algorithms for chemical research so that the state of the art might be assessed. These persons, whose papers are included in this volume, discussed not only the significant developments in algorithms that have already occurred, but also indicated places where currently available algorithms were not adequate. While it is not possible in a single symposium to discuss the entire spectrum of areas where significant algorithmic development has occurred or is needed, an attempt was made to include several of the important areas where progress is evident. In particular, the papers in this volume include discussions of the use of graph theory in algorithm design, algorithm design and choice in quantum chemistry, molecular scattering, solid state description and pattern recognition, and the handling of
vii
Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.pr001
chemical information. As both the authors and the topics indicate, the general topic is extremely diverse in scope, involving expertise from several disciplines in the search for new and improved algorithms. While this area is currently in its infancy, its potential impact is great, and it is hoped that these papers will serve both as a reference to the current state of the art and as an impetus to extend the study of algorithmic development to other areas as well. The University of Kansas Lawrence, Kansas December 1976
Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.pr001
RALPH E. CHRISTOFFERSEN
viii
Graph
A l g o r i t h m s in C h e m i c a l C o m p u t a t i o n
ROBERT ENDRE TARJAN* Computer Science Dept., Stanford University, Stanford, CA 94305
Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001
1.
Introduction.
The use of computers i n science i s widespread. Without powerful number-crunching f a c i l i t i e s at his** disposal, the modern scientist would be greatly handicapped, unable to perform the thousands or millions of calculations required to analyze his data or explore the implications of his favorite theory. He (or his assistant) thus requires at least some familiarity with computers, the programming of computers, and the methods which might be used by computers to solve his problems. An entire branch of mathematics, numerical analysis, exists to analyze the behavior of numerical algorithms. However, the t y p i c a l scientist's appreciation of the computer may be too narrow. Computers are much more than fast adders and multipliers; they are symbol manipulators of a very general kind. A scientist who writes programs i n FORTRAN or some similar, s c i e n t i f i c a l l y oriented computer language, may be unaware of the potential use of computers to solve computational, but not necessarily numeric, problems which might arise in his research. This paper discusses the use of computers to solve nonnumeric problems in chemistry. I shall focus on a particular problem, that of identifying chemical structure, and examine computer methods for solving it. The discussion w i l l include
This research was partially supported by the N a t i o n a l Science Foundation, grant MCS75-22870, and by the O f f i c e o f Naval Research, contract NOOO14-76-C-0688. For the purpose o f smooth reading, I have used the masculine gender throughout t h i s paper.
**
1
ALGORITHMS FOR CHEMICAL COMPUTATIONS
elements o f graph theory, l i s t p r o c e s s i n g , a n a l y s i s o f algorithms, and computational complexity. I -write as a computer s c i e n t i s t , not as a chemist; I s h a l l n e g l e c t d e t a i l s o f chemistry i n order t o focus on i s s u e s of a l g o r i t h m i c a p p l i c a b i l i t y , s i m p l i c i t y , and speed. I t i s my hope t h a t some readers of t h i s paper w i l l become i n t e r e s t e d i n a p p l y i n g t o t h e i r own problems i n chemistry the methods developed i n recent years by computer s c i e n t i s t s and mathematicians. The paper i s d i v i d e d i n t o s e v e r a l s e c t i o n s . Section 2 discusses r e p r e s e n t a t i o n o f chemical molecules as graphs. Section 3 covers complexity measures f o r computer algorithms. Section k surveys what i s loi own about the s t r u c t u r e i d e n t i f i c a t i o n problem i n g e n e r a l . S e c t i o n 5 solves the problem f o r mole cules without r i n g s . S e c t i o n 6 gives a method f o r a n a l y z i n g a molecule by s y s t e m a t i c a l l y b r e a k i n g i t i n t o smaller p a r t s . Section 7 d i s c u s s e s the case o f "planar" molecules. Section 8 o u t l i n e s a complete method f o r s t r u c t u r e i d e n t i f i c a t i o n , and mentions some f u r t h e r a p p l i c a t i o n s o f the ideas contained h e r e i n to chemistry. 2.
Molecules and T h e i r Representation.
Consider a h y p o t h e t i c a l chemical i n f o r m a t i o n system which performs the f o l l o w i n g t a s k s . I f a chemist asks the system about a c e r t a i n molecule, the system w i l l respond with the i n f o r m a t i o n i t has concerning t h a t molecule. I f the chemist asks f o r a l i s t i n g o f a l l molecules which s a t i s f y c e r t a i n p r o p e r t i e s (such as c o n t a i n i n g c e r t a i n r a d i c a l s ) , the system w i l l respond with a l l such molecules known t o i t . I f the chemist asks f o r a l i s t i n g of p o s s i b l e molecules (known or n o t ) , which s a t i s f y c e r t a i n p r o p e r t i e s , the system w i l l p r o v i d e a l i s t . Such an i n f o r m a t i o n system must be able t o i d e n t i f y molecules on the b a s i s o f t h e i r s t r u c t u r e . Given a molecule, the system must d e r i v e a unique code f o r the molecule, so t h a t the code can be looked up i n a t a b l e and the p r o p e r t i e s o f the molecule l o c a t e d . I t i s t h i s coding or c a t a l o g i n g problem which I want t o consider here. A number of codes f o r molecules have been proposed and used; e.g. see (1,2,3,Ij-). The e x i s t e n c e o f many d i f f e r e n t codes w i t h no s i n g l e standard suggests the importance and the d i f f i c u l t y of the problem. I s h a l l attempt t o e x p l a i n why the problem i s d i f f i c u l t , and t o suggest some computer approaches t o it. To d e a l with the problem i n a r i g o r o u s fashion, we couch i t w i t h i n the branch of mathematics c a l l e d graph theory. A graph G = (V, E) i s a f i n i t e c o l l e c t i o n V of v e r t i c e s and a f i n i t e c o l l e c t i o n o f edges. Each edge (v,w"5 c o n s i s t s of an unordered p a i r of d i s t i n c t v e r t i c e s . Each edge and each v e r t e x may i n a d d i t i o n have a l a b e l s p e c i f y i n g c e r t a i n i n f o r m a t i o n
1.
TARJAN
Graph Algorithms
about i t . We represent a chemical molecule as a graph by c o n s t r u c t i n g one v e r t e x f o r each atom and one edge f o r each chemical bond; a b a l l - a n d - s t i c k model of a molecule i s r e a l l y a graph r e p r e s e n t a t i o n of i t . We l a b e l each v e r t e x with the type of atom i t r e p r e s e n t s . See F i g u r e 1 f o r an example. Two v e r t i c e s and w of a graph are s a i d t o be adjacent if (v,w) i s an edge of the graph. I f (v,w) i s an edge, and i s a v e r t e x contained i n i t , the edge and v e r t e x are s a i d t o be i n c i d e n t . Two graphs ^ = (V- ,E ) and G = ( V , E ) are
L 1 2 2 2
s a i d t o be isomorphic i f t h e i r v e r t i c e s can be i d e n t i f i e d i n a one-to-one f a s h i o n so t h a t , i f v^ and w^ are v e r t i c e s i n G^ and

1
v
1
and
are the corresponding v e r t i c e s i n G

1
, then
(v ,w ) Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001 of G

2
i s an edge of .
i f and only i f v
1
(v ,w )
2 2
i s an edge ; and
Furthermore the p a i r s
2 2
, v
, w
(v^w^) , ( v , w )
must have the same l a b e l s i f the graphs are
labelled. The problem we s h a l l consider i s t h i s : given two graphs, determine i f they are isomorphic. Or: given a graph, c o n s t r u c t a code f o r i t such t h a t two graphs have the same code i f and only i f they are isomorphic. Notice t h a t t h i s mathematical a b s t r a c t i o n of chemical s t r u c t u r e i d e n t i f i c a t i o n n e g l e c t s some d e t a i l s of chemistry. For instance, we allow bonds between only two mole cules, thereby p r e c l u d i n g the r e p r e s e n t a t i o n of resonance s t r u c tures, and we ignore i s s u e s of stereochemistry ( i f two bonds of a carbon atom are f i x e d , our model allows f r e e interchanging of the other two, whereas i n the r e a l world such interchanging may produce stereoisomers; see F i g u r e 2 ) . However, these are d i f f e r e n c e s of d e t a i l only, which can e a s i l y be i n c o r p o r a t e d i n t o the model; we n e g l e c t them only f o r s i m p l i c i t y . Note a l s o t h a t our model does not allow loops (edges of the form (v,v) ), but i t does a l l o w m u l t i p l e edges (which may be used t o represent m u l t i p l e bonds, or f o r other purposes). A g e n e r a l i z a t i o n o f the isomorphism problem i s the subgraph isomorphism problem. Given two graphs G^ = (V^, E^) and G
2
= (VgjEJg) * we V
2
say
Gj -_
i s a subgraph o f E
2
G .
if
V-^
is a
subset of
and
i s a subset of
The
subgraph G^ -
isomorphism problem i s t h a t of determining i f a given graph i s isomorphic t o a subgraph of another given graph G
2
This i s
one of the problems our h y p o t h e t i c a l information system must solve t o provide a l i s t of molecules c o n t a i n i n g c e r t a i n r a d i c a l s . We s h a l l d e a l with t h i s problem b r i e f l y ; i t seems t o be much harder than the isomorphism problem. I f a computer i s t o e f f i c i e n t l y encode molecules i t must f i r s t have a way t o represent a molecule, or a graph. We consider
Figure 1.
Graphic representation of benzene
Figure 2.
Stereoisomers
1.
TARJAN
Graph Algorithms
two standard ways t o represent graphs i n a computer. The f i r s t i s "by an adjacency matrix. I f G = (V, E) i s a graph with v e r t i c e s numbered from 1 t o , an adjacency matrix f o r G i s the by matrix M = (m. .) w i t h elements 0 and 1 , such that m. . = 1
^-3
i f (v.,v.)
^3
^- 3
i s an edge of
and
~^~3
m. . = 0
other-
wise. See F i g u r e 3 ( a ) , ( b ) . Note t h a t M i s symmetric and t h a t i t s main d i a g o n a l i s zero. The m a t r i x M i s not a code f o r G since i t i s not unique; i t depends upon the v e r t e x numbering. An adjacency matrix r e p r e s e n t a t i o n of a graph has s e v e r a l n i c e p r o p e r t i e s . Many n a t u r a l graph operations correspond t o standard m a t r i x operations (see (5) f o r some examples). The b i t s of M can be packed i n groups i n t o computer words, so t h a t storage of Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001 M requires only /w words, i f w i s the word
l e n g t h o f the machine (or only /2w words, i f advantage i s taken o f the symmetry o f M ). I f M i s packed i n t o words i n t h i s way, the b i t s can be processed w at a time, at l e a s t i n c e r t a i n kinds of computations. However, the matrix r e p r e s e n t a t i o n has some serious disadvan tages. An important p r o p e r t y of graphs r e p r e s e n t i n g chemical molecules i s t h a t they are sparse; most o f the p o t e n t i a l edges are m i s s i n g . Since each atom has a f i x e d , s m a l l valence, the number of edges i n a graph r e p r e s e n t i n g a molecule i s no more than some f i x e d constant times , the number of v e r t i c e s . However, i n an a r b i t r a r y graph the number o f edges can be as l a r g e as
(n -n)/2 (or l a r g e r , i f t h e r e are m u l t i p l e edges). An adjacency matrix f o r a sparse graph contains mostly zeros, but t h e r e i s no good way o f e x p l o i t i n g t h i s f a c t . I t has been proved t h a t t e s t i n g many graph p r o p e r t i e s , i n c l u d i n g isomorphism, r e q u i r e s examining some f i x e d f r a c t i o n of the elements o f the adjacency matrix i n the worst case ( 6 ) . Any a l g o r i t h m which uses a matrix r e p r e s e n t a t i o n
2
of a graph thus runs i n time p r o p o r t i o n a l t o at l e a s t i n the worst case. I f we wish t o d e a l with l a r g e graphs and hope t o get a running time c l o s e t o l i n e a r i n the s i z e o f the graph, we must use a d i f f e r e n t r e p r e s e n t a t i o n . The one we choose i s an adjacency s t r u c t u r e . An adjacency s t r u c t u r e f o r a graph G = (V, E) i s a set o f l i s t s , one f o r each v e r t e x . The l i s t f o r v e r t e x contains a l l v e r t i c e s adjacent to . Note t h a t a given edge (v,w) i s represented twice; w appears i n the adjacency l i s t f o r and appears i n the adjacency l i s t f o r w . See F i g u r e 3 ( c ) . An adjacency s t r u c t u r e i s s u r p r i s i n g l y easy t o d e f i n e and manipulate i n FORTRAN or any other standard programming language. We use t h r e e arrays, which we may c a l l adjacent to, vertex, and next. For any v e r t e x , the element e^ = adjacent t o (v) represents the f i r s t element on the adjacency l i s t f o r v e r t e x v . The corresponding v e r t e x i s v e r t ex (e-, ) , and the element
ALGORITHMS
FOR
CHEMICAL
COMPUTATIONS
1: 2: 3: ^: 5: 6:
2, k, 6 1, 3 , 6 2, k, 5 1, 3 , 5 3,
1, 2, 5
(c)
1
adjacent t o :
2 2
k 5
3 8
1U
U 5
6 6
1 2 3
vertex: next:
67
8 9 10 11 12 13 Hi- 15 16 17 18 2 6 2 3 5 3
16
2 1
1 6 1 3
1 6 + 17
3 7 5 12
10 9 11
(d)
18 13 15
/ / //
Figure 3. Graphic representations: (a) graph, (b) adjacency matrix, (c) adjacency structure, and (d) array representation of adjacency structure
1.
TARJAN
Graph Algorithms
7
A null
= nexte^)
represents t h e next element on t h e l i s t ,
element i n d i c a t e s the end o f t h e l i s t . See F i g u r e 3(<3L) The t o t a l amount o f storage r e q u i r e d by these arrays i s n+^m , where i s t h e number of v e r t i c e s i n t h e graph and m i s t h e number of edges; the t o t a l storage i s thus l i n e a r i n t h e s i z e o f the graph. Searches and other n a t u r a l graph operations are easy t o implement u s i n g such a data s t r u c t u r e ; e . g . see (7, 8 ) . I f t h e graph i s l a b e l l e d we can use two e x t r a arrays which g i v e v e r t e x and edge l a b e l s . Athough t h e m a t r i x r e p r e s e n t a t i o n of a graph i s simple and mathematically elegant, the adjacency s t r u c t u r e r e p r e s e n t a t i o n seems t o be much more u s e f u l f o r computers. 3.
Notions o f Complexity.
I f we are t o d i s c u s s computer methods, we need some way o f measuring t h e performance o f an a l g o r i t h m . We would l i k e our code f o r molecules t o be simple, n a t u r a l , and easy t o compute. Concepts l i k e " s i m p l e " and " n a t u r a l " , although v e r y important i n any r e a l - w o r l d c a t a l o g u i n g system, are d i f f i c u l t t o define and q u a n t i f y . We s h a l l use a measure based on a machine's p o i n t o f view, r a t h e r than on a human's. Though an a l g o r i t h m good b y such a measure may be unwieldy f o r human use, at b e s t a method u s e f u l f o r machines w i l l a l s o be u s e f u l f o r people. At worst, such a measure provides a f i r m base f o r d i s c u s s i o n o f t h e m e r i t s o f v a r i o u s methods. One p o s s i b l e measure o f a l g o r i t h m i c complexity i s program size. Such a measure i s r e l a t e d t o t h e i n h e r e n t s i m p l i c i t y or complexity o f a method. This measure i s s t a t i c ; i t i s independent of t h e s i z e or s t r u c t u r e o f t h e p a r t i c u l a r i n p u t d a t a . Some other p o s s i b l e measures are dynamic; they measure t h e amount o f a resource used by the method as a f u n c t i o n o f t h e s i z e o f t h e i n p u t d a t a . T y p i c a l dynamic measures are running time and storage space. Program s i z e as a measure has the disadvantage t h a t i n many cases t h e simplest a l g o r i t h m i s a b r u t e f o r c e examination o f a l l p o s s i b i l i t i e s ; t h e running time o f such an a l g o r i t h m i s exponen t i a l i n the s i z e o f t h e i n p u t and thus only v e r y s m a l l graphs can be analyzed. The algorithms we s h a l l consider a l l use storage space l i n e a r o r quadratic i n t h e number o f v e r t i c e s i n the i n p u t graph; thus storage space as a measure does not d i s c r i m i n a t e f i n e l y enough f o r our purposes. The running time o f an a l g o r i t h m i s s t r o n g l y r e l a t e d t o t h e a l g o r i t h m ' s usefulness i f i t i s r u n many t i m e s . We t h e r e f o r e choose running time as a f u n c t i o n o f i n p u t s i z e as our measure o f complexity. How s h a l l we measure running time? One p o s s i b i l i t y i s t o r u n the program s e v e r a l times on v a r i o u s sets o f i n p u t data and extrapolate. This approach i s v e r y dangerous. I f t h e number o f examples t r i e d i s t o o s m a l l , the e x t r a p o l a t i o n i s probably meaningless. I f the number o f examples t r i e d i s l a r g e and drawn
from a s u i t a b l y d e f i n e d random population, the e x t r a p o l a t i o n may be s t a t i s t i c a l l y meaningful. However, d e f i n i n g a random graph i n a way which i s r e a l i s t i c f o r chemistry i s a very t r i c k y problem. Furthermore any s t a t i s t i c a l method may miss r a r e but very bad cases; we would not l i k e our c a t a l o g u i n g system t o spend hours on an o c c a s i o n a l b i z a r r e molecule. We are t h e r e f o r e o n l y s a t i s f i e d with a c a r e f u l t h e o r e t i c a l a n a l y s i s of an a l g o r i t h m l e a d i n g t o a worst-case bound on i t s running time. To account f o r v a r i a b i l i t y i n machines, we ignore constant f a c t o r s and pay a t t e n t i o n only t o the asymptotic growth r a t e o f the running time as a f u n c t i o n of the s i z e o f the problem graph. Our measure i s thus machine independent and most v a l i d f o r l a r g e graphs. I f machine-dependent constant f a c t o r s and running time on s m a l l graphs are o f i n t e r e s t , computer experiments or a more d e t a i l e d a n a l y s i s must be used. For convenience, we s h a l l use the n o t a t i o n " f ( n ) i s 0(g(n)) " t o denote t h a t the f u n c t i o n f ( n ) satisfies f ( n ) < cg(n) f o r some p o s i t i v e constant c and a l l , where f and g are non-negative f u n c t i o n s o f . k* Isomorphism and Subgraph Isomorphism. The isomorphism problem f o r g e n e r a l graphs i s not an easy Given two graphs G ^ and G o f v e r t i c e s , the number
2
one.
of p o s s i b l e one-to-one mappings o f v e r t i c e s i s n l , and a b r u t e f o r c e approach, which t r i e s a l l the p o s s i b i l i t i e s , i s too timeconsuming except f o r s m a l l graphs. A b a c k t r a c k i n g search (9); f a r e s somewhat b e t t e r . I n i t i a l l y , one v e r t e x from each graph i s chosen, and these v e r t i c e s are matched. In general, some v e r t e x w^ adjacent t o an already-matched v e r t e x v ^ i n G ^ i s chosen and matched with some v e r t e x G
2
w^
adjacent t o the v e r t e x Then w

1
v^
in
p r e v i o u s l y matched t o
v^ .
and
are compared
t o make sure t h e i r adjacencies with already-matched v e r t i c e s are c o n s i s t e n t . I f so, a new v e r t e x f o r matching i s chosen. I f not, the l a s t matched p a i r i s unmatched and a new matching t r i e d . The process continues u n t i l e i t h e r al 1 v e r t i c e s are matched or there i s found t o be no way of matching the v e r t e x sets of the two graphs. Backtrack search saves time over the b r u t e f o r c e method by abandoning an attempt at matching as soon as i t i s known t o f a i l . The running time of b a c k t r a c k search depends i n a complicated way upon the s t r u c t u r e of the graph; the b e s t we can say i n g e n e r a l i s t h a t i f d i s the maximum valence (number o f v e r t i c e s adjacent t o a given v e r t e x ) i n e i t h e r graph, the maximum running time o f back t r a c k search i s 0 ( ( d - l ) ) s t i l l exponential, but b e t t e r than brute force. The most s u c c e s s f u l algorithms f o r g e n e r a l graph isomorphism use the b a c k t r a c k approach (as a f a l l - b a c k method) i n combination
n
1.
TARJAN
Graph Algorithms
with a p a r t i t i o n i n g method ( 1 0 , 1 1 , 1 2 , 1 3 ) . The i d e a i s t o p a r t i t i o n the combined v e r t e x sets o f the two graphs so t h a t any isomorphic mapping between the graphs preserves the p a r t i t i o n i n g . The method has f o u r main steps. 1. 2. Choose an i n i t i a l p a r t i t i o n o f the v e r t e x s e t s . Refine the p a r t i t i o n . I f any subset o f the p a r t i t i o n contains more v e r t i c e s from one graph than from the other, go t o step k. I f each subset o f the p a r t i t i o n contains a s i n g l e v e r t e x from each graph, t r y the i m p l i e d matching t o see i f i t gives an isomorphism. I f i t does, h a l t with the isomorphism; i f not, go t o step k* I f some subset contains two o r more v e r t i c e s from one graph, choose a v e r t e x i n t h i s subset from each graph, match these v e r t i c e s , and go t o step 2 (the new matching allows f u r t h e r refinement o f the p a r t i t i o n ) . Backtrack, Back up t o the p a r t i t i o n e x i s t i n g when t h e l a s t match was made. T r y a new match and go t o step 2 . I f a l l matches have been t r i e d , back up t o the previous match. I f a l l p o s s i b i l i t i e s f o r the v e r y f i r s t match have been t r i e d , h a l t . The graphs are not isomorphic.
i+.
For the i n i t i a l p a r t i t i o n we d i v i d e v e r t i c e s up according t o t h e i r l a b e l s and t h e i r valences. Other more elaborate p a r t i t i o n i n g s are p o s s i b l e ; see (1^,15). We c a r r y out the refinement, step i n the f o l l o w i n g way. F o r each vertex, we determine the number o f adjacent v e r t i c e s i n each subset o f the p a r t i t i o n . This information i t s e l f p a r t i t i o n s the v e r t i c e s . We take the i n t e r s e c t i o n o f t h i s p a r t i t i o n with the o l d p a r t i t i o n as our new p a r t i t i o n . We repeat t h i s r e f i n i n g step u n t i l no f u r t h e r refinement takes p l a c e . Implementation o f the repeated refinement step i s somewhat t r i c k y ; Hopcroft ( l 6 ) has p r o v i d e d a good implementation. The e f f e c t o f matching xwo v e r t i c e s i n step 3 i s t o p l a c e them by themselves i n a new subset of the p a r t i t i o n . Thus step 3 guarantees refinement o f t h e partition. See Figure h f o r an example o f the a p p l i c a t i o n o f the algorithm. The i d e a behind t h i s algorithm i s t o use a l l p o s s i b l e l o c a l means o f d i s t i n g u i s h i n g between v e r t i c e s before guessing a match. The method seems t o work quite w e l l i n p r a c t i c e . I t i s possible t h a t some v e r s i o n o f t h i s p a r t i t i o n i n g method has a time bound which i s a polynomial f u n c t i o n o f . (To prove t h i s r e q u i r e s showing t h a t t h e amount o f b a c k t r a c k i n g i s polynomial i n ; t h e refinement step r e q u i r e s only 0(m l o g m) time, where m i s t h e number o f edges, i f H o p c r o f t s implementation i s used.) However, the present t h e o r e t i c a l bounds on the a l g o r i t h m are no b e t t e r than those f o r backtrack search. I t i s a major open question whether a polynomial-time algorithm e x i s t s f o r the g e n e r a l graph isomorphism problem. The s i t u a t i o n f o r the subgraph isomorphism problem i s some what b e t t e r understood and somewhat more gloomy. I t i s p o s s i b l e
f
10
10 (a)
(b )
{l, 2 , 3 , 5,6,7,8,9,10,11,12} A: valence 3 [2,3,h,5,6,8,9,10,11,12} C [2,k,6,8,10,12}(3,5,9,11} D: IB, 2C {2,6} F: IB, ID, I E E: 3C {^,8,10,12} G: IB, 2E {3,5} H: 2D, I E {9,11} I: IB, 2D
(c)
{1,7}
(d)
{1,7}
(e)
{1,7} B
Figure 4. Isomorphism test by partitioning: (a) graphs, (b) initial partitionne) ini tial match 17, (d) first refinement, and (e) further refinement (match fails since F contains no vertices of second graph). Complete test requires matching 1 succes sively to 8,9,10,11,12, failing each time.
Figure 5.
A tree
1.
TARJAN
Graph Algorithms
11
t o g e n e r a l i z e the p a r t i t i o n i n g algorithm d e s c r i b e d above so that i t solves the subgraph isomorphism problem (17 ) However, the r e s u l t s of t h i s method i n p r a c t i c e seem t o be mixed. Furthermore i t has been proved that the subgraph isomorphism problem belongs t o a c l a s s of problems c a l l e d NP-complete. The NP-complete problems i n c l u d e a number of w e l l - s t u d i e d , apparently hard problems such as the t r a v e l l i n g salesman problem of operations research, the t a u t o l o g y problem of p r o p o s i t i o n a l c a l c u l u s , and many other combinatorial problems. The NP-complete problems have the property t h a t i f any one o f them has a polynomial-time algorithm, they a l l do. Since no one has discovered a polynomialtime algorithm f o r any o f these problems, though many people have t r i e d , i t seems l i k e l y t h a t none o f these problems i s s o l v a b l e i n polynomial time. I t i s not known whether the graph isomorphism problem i t s e l f i s NP-complete. For a d i s c u s s i o n of NP-complete problems, see ( 1 8 , 1 9 , 2 0 ) . I t would seem that our attempt t o solve the graph isomorphism problem with a p r o v a b l y good a l g o r i t h m i s doomed t o f a i l u r e , and t h a t we must be s a t i s f i e d w i t h a h e u r i s t i c ; that i s , with a method which seems t o work w e l l i n many cases f o r reasons which we do not understand. However, by lowering our sigjits somewhat, we can go a l o n g way toward a s o l u t i o n which i s both p r a c t i c a l and t h e o r e t i c a l l y e f f i c i e n t . We s h a l l f i r s t consider the isomorphism problem f o r t r e e s . For such graphs, there i s a good isomorphism algorithm. Next, we study a decomposition method f o r r e p r e s e n t i n g a graph as a c o l l e c t i o n of smaller graphs j o i n e d i n a t r e e - l i k e f a s h i o n . We then examine the important s p e c i a l case of p l a n a r graphs. F i n a l l y , we combine these ideas t o produce an isomorphism algorithm which i s very f a s t on p l a n a r graphs and i s l i k e l y t o work w e l l on most, i f not a l l , chemical molecules. 5. Codes f o r Trees.
Let G = (V, ) be a d i r e c t e d graph. A simple path from a v e r t e x v^ t o a v e r t e x v i n G i s a sequence o f d i s t i n c t

fc
edges (v-^Vg) ( 2 j ) > > ^ k - l k ^ * ^ P i s k-1 , the number o f edges i t contains. A c y c l e i s a simple path from a v e r t e x v to i t s e l f . A graph i s connected i f every
y v , v v , v T h e l e n t h o f t h e a t h 1
p a i r o f v e r t i c e s i s j o i n e d by a path. In the d e s c r i p t i o n of a backtrack search i n S e c t i o n k we i m p l i c i t l y assumed t h a t the graphs of i n t e r e s t were connected; we s h a l l continue t o make t h i s assumption. A t r e e i s a connected graph with no c y c l e s (see F i g u r e 5 f o r an example). In c o n t r a s t t o the isomorphism problem f o r general graphs, the isomorphism problem f o r t r e e s i s r e l a t i v e l y easy. Any t r e e with v e r t i c e s has e x a c t l y n-1 edges. We s h a l l d e s c r i b e an algorithm f o r c o n s t r u c t i n g , i n 0(n) time, a code f o r any t r e e , such t h a t two t r e e s are isomorphic i f and only i f they have
12
ALGORITHMS
FOR CHEMICAL
COMPUTATIONS
i d e n t i c a l codes, v a r i a n t s o f t h e algorithm have appeared i n many p l a c e s (21,22,23,21+) and i t has i n f a c t been used i n chemical computation ( 2 5 ) . To e x t r a c t a unique code f o r a t r e e we must f i r s t put t h e t r e e i n t o a c a n o n i c a l form. The f i r s t step i n doing t h i s i s t o f i n d a uniquely determined v e r t e x o r edge i n t h e t r e e . A t r e e has a t l e a s t two v e r t i c e s o f valence one. We c a l l such v e r t i c e s l e a v e s . F o r a given v e r t e x , l e t the height h ( v ) o f be the l e n g t h o f the longest path from t o a l e a f . A t r e e contains e i t h e r a unique v e r t e x o f l a r g e s t height, o r two adjacent v e r t i c e s o f l a r g e s t h e i g h t ( 2 6 ) , Since h e i g h t must be preserved under isomorphism, t h i s unique v e r t e x or p a i r o f v e r t i c e s can be used as a s t a r t i n g p o i n t f o r c o n s t r u c t i o n o f the c a n o n i c a l t r e e . I f t h e r e are two v e r t i c e s o f l a r g e s t height, we add a new v e r t e x i n the middle o f the edge j o i n i n g them and l a b e l i t as a dummy v e r t e x . Then we can assume our t r e e always has a unique v e r t e x o f l a r g e s t height, which we c a l l t h e r o o t . Each v e r t e x except the root has a unique parent u which i s adjacent t o and s a t i s f i e s h(u) > h ( v ) + l . ALL other v e r t i c e s w adjacent t o are c a l l e d i t s c h i l d r e n and s a t i s f y h(w) < h ( v ) - l . We d e f i n e ancestors and descendants i n t h e obvious way. Each v e r t e x i n the t r e e d e f i n e s a subtree c o n s i s t i n g o f and i t s descendants (see F i g u r e 6 ) . We d e f i n e a t o t a l o r d e r i n g with v e r t e x l a b e l s by the following rules. (1) (2) If roots, If roots, and U are two t r e e s with d i f f e r e n t l a b e l s on t h e i r order t h e t r e e s according t o t h e l a b e l s o f t h e r o o t s . and U are two t r e e s with t h e same l a b e l on t h e i r l e t T^, Tg, .., T be the subtrees d e f i n e d b y t h e
fc
c h i l d r e n o f t h e root o f ( i n i n c r e a s i n g order) and l e t U^,U^, >U^ ê the subtrees d e f i n e d b y the c h i l d r e n o f the root o f U . isomorphic or i f then I f t h e r e i s some index j such t h a t T^ i s
t o U. f o r i < j and T. i s l e s s than U. , J 3 3 i s isomorphic t o IL f o r 1 < i < k and k < I , U .
i s d e f i n e d t o be l e s s than
That i s , t o compare two t r e e s , we f i r s t compare t h e i r r o o t labels. I f these are i d e n t i c a l , we order t h e subtrees d e f i n e d by the c h i l d r e n o f the roots, and compare t h e ordered sequences o f subtrees l e x i c o g r a p h i c a l l y . Using t h i s ordering, we can c o n s t r u c t a c a n o n i c a l representa t i o n o f a given t r e e by r e o r d e r i n g the c h i l d r e n o f each v e r t e x a c c o r d i n g t o t h e order d e f i n e d above. See Figure 6 . From t h i s c a n o n i c a l r e p r e s e n t a t i o n , we can c o n s t r u c t a l i n e a r code which represents the t r e e uniquely. There are many p o s s i b l e ways t o do t h i s ; one way i s d e f i n e d by the f o l l o w i n g r u l e s .
1.
(1) (2)
TARJAN
Graph Algorithms
13
The code code(T) f o r a t r e e c o n s i s t i n g o f a s i n g l e vertex i s i t s l a b e l . I f i s a t r e e o f more than one vertex, and T^,>\ are the subtrees d e f i n e d by the c h i l d r e n of the r o o t s of ( i n order), then the code f o r i s code(T) = code(root)(code(l^)code(T^) ... c o d e ( T ) ) .
k
For instance, the code f o r the molecule i n Figure 6 i s C(C(C1HH)C(HHH)C(HHH)0(H)). This method gives a unique code f o r each t r e e ; two t r e e s are isomorphic i f and only i f they have the same code (we have n e g l e c t e d t o i n c l u d e edge l a b e l s i n the code, but i t i s easy t o do so i f necessary). The code i s quite n a t u r a l , and i t i s easy t o r e c o n s t r u c t a t r e e given i t s code. The r e o r d e r i n g of subtrees i s what guarantees t h a t each t r e e has only one code. One can vary the exact d e f i n i t i o n of the o r d e r i n g ; what i s important i s t h a t the subtrees be ordered somehow. When t h i s algorithm i s a p p l i e d t o chemical molecules, i t i s u s e f u l t o use abbreviations i n the code, such as o m i t t i n g e x p l i c i t reference t o hydrogen atoms; e.g. see (27). Implementing the r e o r d e r i n g algorithm i s somewhat complicated, since the s o r t i n g r e q u i r e s comparison of sequences element-by-element. See (28) f o r a good implementation. Constructing the code f o r a t r e e of v e r t i c e s r e q u i r e s 0(n) time with t h i s implementation. We can expect t o f i n d no f a s t e r algorithm, since any method must i n s p e c t the e n t i r e t r e e . On t r e e s , not only i s the isomorphism problem e f f i c i e n t l y solvable, but so i s the subgraph isomorphism problem. Edmonds and Matula (29) have d i s c o v e r e d an algorithm which w i l l determine whether one t r e e i s isomorphic to a subtree of another i n 0 ( n ^ / ) time, where i s the number o f v e r t i c e s i n the l a r g e r t r e e . This bound can be improved s u b s t a n t i a l l y i f the valence o f a l l v e r t i c e s i s bounded by a small constant. The algorithm may be of p r a c t i c a l value, but t h i s has yet t o be t e s t e d .
2
6.
Decomposition by
Connectivity.
Though the algorithm of Section 5 f o r encoding t r e e s i s simple and f a s t , most chemical molecules are not t r e e s . However, they are quite sparse and o f t e n t r e e - l i k e . Our approach i n t h i s s e c t i o n w i l l be t o represent an a r b i t r a r y graph as a number of p i e c e s l i n k e d i n t r e e - l i k e f a s h i o n . We can then encode the graph by encoding each p i e c e separately, u s i n g these codes as l a b e l s on the l i n k a g e t r e e , and a p p l y i n g the t r e e encoding algorithm of S e c t i o n 5 t o encode the e n t i r e graph. In t h i s way we can make the most out of our t r e e encoding method; the n o n - t r e e - l i k e p a r t s of the graph w i l l u s u a l l y be small. To decompose a graph, we determine i t s c o n n e c t i v i t y . Let G = (V, ) be a connected graph. A cut set o f G i s a subset
14
Figure 6.
Tree of Figure 5 in canonical form. Dashes enclose subtrees of children of the root.
1.
TARJAN
Graph Algorithms
15
of v e r t i c e s S such t h a t there are a t l e a s t two v e r t i c e s and w (not i n S ) f o r which every path from t o w passes through a vertex i n S . Removal o f the v e r t i c e s i n S thus breaks G i n t o two or more connected p i e c e s . I f we add the v e r t i c e s i n G t o each piece, the r e s u l t a n t subgraphs o f G are c a l l e d the components of G with respect t o the cutset S . We concentrate on c u t s e t s c o n t a i n i n g no more than two v e r t i c e s . By applying the f o l l o w i n g procedure, we break G i n t o a number o f smaller graphs. Decomposition algorithm. Begin with a s i n g l e component c o n s i s t i n g of the e n t i r e graph. Repeat the f o l l o w i n g step u n t i l i t no longer a p p l i e s : F i n d a c u t s e t o f s i z e one or two i n some component. I f i t i s a cutset o f s i z e one, subdivide the component i n t o i t s components w i t h respect t o the c u t s e t . I f i t i s a cutset o f s i z e two, say {v,w} , subdivide the component i n t o i t s components with respect t o the cutset, and add a new (dummy) edge (v,w) t o each new component. The importance f o r isomorphism t e s t i n g o f t h i s algorithm i s t h r e e - f o l d : f i r s t , the components found by the algorithm are e s s e n t i a l l y unique (preserved under isomorphism). (To guarantee uniqueness we must s l i g h t l y modify the d e f i n i t i o n o f components with respect t o cutsets o f s i z e two; see ( 3 0 , 3 1 , 3 2 ) . Second, the way the components f i t together can be represented by a decompo s i t i o n t r e e (33). This t r e e contains one vertex f o r each component and one v e r t e x f o r each c u t s e t . A c u t s e t i s adjacent t o a component i n the t r e e i f the v e r t i c e s of the cutset are i n the component. Figure 7 gives an example o f a graph, i t s components, and i t s decomposition t r e e . Third, i t i s easy t o f i n d the components and the decomposi t i o n t r e e . An algorithm f o r t h i s purpose, which uses depth f i r s t search (a systematic method of e x p l o r i n g a graph) has been developed ( 3 ^ , 3 5 , 3 6 ) . I t runs i n 0(n+m) time on an vertex, m edge graph. Each component with respect t o the decomposition i s of one of three kinds a bond ( s i n g l e edge or set o f m u l t i p l e edges), a c y c l e , or a graph with no m u l t i p l e edges and no c u t s e t s of s i z e one or two, c a l l e d a t r i c o n n e c t e d graph. I t i s easy t o encode bonds and c y c l e s ; a l l t h a t i s missing i s a method o f encoding t r i c o n n e c t e d graphs. I f we can encode a l l the components, we can use the r e s u l t a n t codes as l a b e l s i n the decomposition t r e e and apply the Section 5 algorithm t o encode the e n t i r e t r e e . The running time of t h i s algorithm w i l l be 0(n+m) f o r everything except the encoding o f the t r i c o n n e c t e d components. I f we use the p a r t i t i o n i n g method o f Section k as a b a s i s f o r encoding t r i c o n n e c t e d components, the complete algorithm w i l l probably do quite w e l l i n p r a c t i c e . However, we have one more improvement t o consider.
16 7 Planar Graphs.
A Planar graph i s a graph which can be drawn on a p i e c e of paper i n such a way t h a t no edges c r o s s . Most chemical molecules (with the p o s s i b l e exception of complex organic molecules) are p l a n a r (note t h a t t h i s does not mean p l a n a r i n the sense of stereochemistry). For p l a n a r graphs the isomorphism problem a l s o has an easy s o l u t i o n . When a graph i s drawn i n the plane, the drawing s p e c i f i e s a c i r c u l a r o r d e r i n g of the edges around each v e r t e x . A t r i c o n n e c t e d graph has the p r o p e r t y t h a t , i f i t i s planar, i t s p l a n a r represen t a t i o n i s unique up t o m i r r o r image. Thus there are only two ways o f drawing a t r i c o n n e c t e d p l a n a r graph i n the plane (two ways of s p e c i f y i n g the c i r c u l a r o r d e r i n g o f edges around each v e r t e x ) . We can use t h i s uniqueness t o d e r i v e a code f o r any p l a n a r t r i c o n n e c t e d graph. F i r s t , we represent the graph i n the plane. This can be done i n 0(n) time (37). Next, we encode i t . One way t o do t h i s was suggested by Weinberg (38) We explore the graph i n the f o l l o w i n g way. We p i c k some s t a r t i n g edge and t r a v e r s e i t from one end t o the other. When reaching the other end, we choose the next edge clockwise around the v e r t e x and traverse i t . We continue t r a v e r s i n g edges i n t h i s way. Whenever we reach a v e r t e x reached p r e v i o u s l y , we back up along the most r e c e n t l y t r a v e r s e d edge and p i c k the next edge clockwise. We continue the search u n t i l we have t r a v e r s e d each edge i n both d i r e c t i o n s and returned t o our s t a r t i n g p o i n t . Such a search i s u n i q u e l y determined by the choice of the s t a r t i n g edge and the d i r e c t i o n t o t r a v e r s e i t . We can construct a l i n e a r code during the search by w r i t i n g a number (and a l a b e l ) f o r each v e r t e x reached, numbering the f i r s t v e r t e x one, the next two, and so on. See F i g u r e 8. To get a unique code, we c o n s t r u c t a code f o r each p o s s i b l e edge and d i r e c t i o n of t r a v e r s a l , f o r each of the two p l a n a r r e p r e s e n t a t i o n s of the graph. Then we choose the l e x i c o g r a p h i c a l l y smallest of a l l the p o s s i b l e codes. A t r i connected p l a n a r graph of > 3 v e r t i c e s has at most 3*1-6 edges (39)y so we generate at most 12n-2^ codes, each o f l e n g t h
2
, and the t o t a l time t o get a unique code i s 0(n ) . This encoding algorithm i s v e r y easy t o program, but i t i s p o s s i b l e t o get a f a s t e r algorithm by u s i n g more s o p h i s t i c a t e d methods. Hopcroft's p a r t i t i o n i n g algorithm (ho) can be used t o encode t r i c o n n e c t e d p l a n a r graphs i n 0(n l o g n) time (hi), Hopcroft and Wong (^2) have devised a v e r y complicated a l g o r i t h m which w i l l encode a t r i c o n n e c t e d p l a n a r graph i n 0(n) time. More r e c e n t l y , Fontet (^3) has devised a simpler 0(n) -time encoding algorithm. The p r a c t i c a l i t y of these algorithms has yet t o be t e s t e d .
1.
TARJAN
Graph Algorithms
17
2[1]
UR1
3[2]
(a)
(b)
(c)
1 2 3 ^ 1 ^ 5 6 1 6 2 6 5 3 5 ^3 2 1
1 2 3 klk5 15 6 2 6 3 65 h3 2 1
Figure 8. (a) Vlanar Graph, (b) Code extracted by search starting with edge (1,2). (Vertices are numbered in search order.) (c) Code extracted by search starting with edge (2,3). (Numbers in brackets give the numbering for this search.) Code (c) is chosen since it is smaller lexicographically. All other codes are identical to either (b) or (c).
18 8.
ALGORITHMS
FOR
CHEMICAL
COMPUTATIONS
Summary and Other A p p l i c a t i o n s .
We are now i n a p o s i t i o n t o o u t l i n e a complete isomorphism algorithm. We t e s t isomorphism of two graphs by encoding each graph and t e s t i n g the codes f o r e q u a l i t y . To encode a graph, we decompose i t by f i n d i n g a l l c u t s e t s of s i z e one and two, and forming the corresponding components and decomposition t r e e . We encode each bond component and each c y c l e component i n some obvious way. We encode each t r i c o n n e c t e d component as f o l l o w s . We t e s t the component f o r p l a n a r i t y . I f i t i s planar, we encode i t u s i n g one of the methods i n Section 7. I f i t i s not planar, we encode i t u s i n g the p a r t i t i o n i n g algorithm of Section k We use the codes f o r components as l a b e l s i n the decomposition t r e e , and encode the t r e e (and thus the e n t i r e graph) u s i n g the method of S e c t i o n 5 The o v e r a l l r e s u l t i s a method with a running time of 0(n+m) on -vertex, m-edge graphs, p l u s whatever time i s r e q u i r e d t o encode non-planar t r i c o n n e c t e d components. Though t h i s algorithm has many p a r t s , and programming i t i s quite a job, i t has the p o t e n t i a l t o be of p r a c t i c a l v a l u e . Though most o f the p a r t s of the algorithm have been programmed i n d i v i d u a l l y , the complete algorithm has not been programmed. Hopefully, t h i s s i t u a t i o n w i l l be remedied i n the near f u t u r e . Though the isomorphism problem i s a formidable one, we have examined some ideas and some methods which can go a l o n g way toward s o l v i n g i t . Many o f the ideas we have considered have a p p l i c a t i o n s i n other areas of chemistry. For instance, we have d i s c u s s e d r e p r e s e n t i n g a sparse graph as an adjacency matrix with many zeros. We can t u r n t h i s i d e a around and use a graph t o represent a sparse matrix (the matrix elements become l a b e l s f o r the corresponding graph edges). We can then apply grapht h e o r e t i c techniques t o m a t r i x problems such as s o l v i n g a system of l i n e a r equations and computing eigenvalues and A l a r g e l i t e r a t u r e has developed i n t h i s area; see (hk,k5,h6) for instance. Another a p p l i c a t i o n o f graph t h e o r y t o chemistry i s i n chromosome a n a l y s i s . Suppose a chromosome i s broken i n t o a number o f p i e c e s and each p i e c e analyzed. I f t h i s i s done a number o f times, the p i e c e s found w i l l overlap i n v a r i o u s ways. The problem i s t o use the overlap i n f o r m a t i o n t o r e c o n s t r u c t the e n t i r e chromosome. For l i n e a r chromosomes, a l i n e a r - t i m e algorithm has been developed t o solve t h i s problem (V7,W3). For chromosomes which are r i n g s , the problem seems s u r p r i s i n g l y t o be much harder and no good algorithm i s known (1+9)
1.
TARJAN
Graph
Algorithms
19
(1) (2)
(3)
(4) (5)
(6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24)
"Survey of Chemical Notation Systems," National Academy of Sciences, National Research Council Publication 1150, 1964. Lederberg, J., "Dendral-64, a System for Computer Construc tion, Enumeration, and Notation of Organic Molecules as Tree Structures and Cyclic Graphs, Part I , " NASA S c i e n t i f i c and Technical Aerospace Report, STAR No. N65-13158 and CR 57029, 1964. Lederberg, J., "Dendral-64, a System for Computer Construc t i o n , Enumeration and Notation of Organic Molecules as Tree Structures and Cyclic Graphs, Part II," NASA S c i e n t i f i c and Technical Aerospace Report, STAR No. N66-14074 and CR 68898, 1965. Sussenguth, E., Jr., J. Chem. Doc. (1965) 5, 36-43. Harary, F . , "Graph Theory," 150-151, Addison-Wesley, Reading, Mass., 1969. Rivest, R. and Vuillemin, J., Seventh ACM Symp. on Theory of Computing (1975), 6-11. Hopcroft, J . E . and Tarjan, R. E . , Comm. ACM (1973) 16, 372-378. Tarjan, R., SIAM J. Comput. (1972) 1, 146-160. Berztiss, A. T., Journal ACM (1973) 20, 365-377. Corneil, D. G. and Gotlieb, C. C., Journal ACM (1970) 17, 51-64. Schmidt, D. C. and Druffel, L. E . , Journal ACM (1976) 23, 433-445. Sussenguth, E . , J r . , J . Chem. Doc. (1965) 5, 36-43. Unger, S. H . , Comm. ACM (1964) 7, 26-34. Corneil, D. G. and Gotlieb, C. C., Journal ACM (1970) 17,
51-64.
Schmidt, D. C. and Druffel, L. E. Journal ACM (1976) 23, 433-445. Hopcroft, J . E . , in Kohavi, Z. and Paz, ., eds., "Theory of Machines and Computations," 189-196, Academic Press, New York, 1971. Sussenguth, E . , Jr., J. Chem. Doc. (1965) 5, 36-43. Cook, S., Third ACM Symp. on Theory of Computing (1971), 151-158. Karp, R. M . , in M i l l e r , R. E. and Thatcher, J . W., eds., "Complexity of Computer Computations," 85-104, Plenum Press, New York, 1972. Karp, R. M . , Networks (1975) 5, 45-68. Busacker, R. G. and Saaty, T. L., " F i n i t e Graphs and Networks: An Introduction with Applications," 196-199, McGraw-Hill, New York, 1965. Lederberg, J., NASA S c i e n t i f i c and Technical Aerospace Report, STAR No. N65-13158 and CR 57029, 1964. Scoins, . I., Machine Intelligence (1968) 3, 43-60. Weinberg, L . , Proc. Third Annual Allerton Conf. on C i r c u i t and System Theory (1965), 733-744.
20 (25)
(26)
A G RT M F R CHEMICAL C M U A I N L OIH S O O P T TO S Lederberg, J., NASA Scientific and Technical Aerospace Report, STAR No. N65-13158 and C 5 7 0 2 9 , 1 9 6 4 . R Harary, F . , "Graph Theory," 35-36, Addison-Wesley, Reading, Mass., 1 9 6 9 . Lederberg, J., NASA Scientific and Technical Aerospace Report, STAR No. N65-13158 and C 5 7 0 2 9 , 1 9 6 4 . R Aho, . V., Hopcroft, J . E . , and Ullman, J . D., "The Design and Analysis of Computer Algorithms," 84-86, Addison-Wesley, Reading, Mass., 1 9 7 4 . Matula, D. W., SIAM Review ( 1 9 6 8 ) 1 0 , 2 7 3 - 2 7 4 . Hopcroft, J . E. and Tarjan, R. E . , SIAM J . Comput. ( 1 9 7 3 ) 2 ,
135-158.
(27) (28)
(29) (30) (31)
(32)
(33) (34) (35) (36) (37) (38)
Maclaine, S., Duke Math. J . ( 1 9 3 7 ) 3 , 4 6 0 - 4 7 2 . Tutte, W T., "Connectivity in Graphs," University of . Toronto Press, Toronto, 1 9 6 6 . Harary, F . , "Graph Theory," 36-37, Addison-Wesley, Reading, Mass., 1 9 6 9 . Hopcroft, J . E. and Tarjan, R. E . , SIAM J . Comput. ( 1 9 7 3 ) 2 ,
135-158.
Hopcroft, J . E. and Tarjan, R. E . , C m . ACM o m

372-378.
(1973)
16,
Tarjan, R. E . , SIAM J . Comput. ( 1 9 7 2 ) 1 , 1 4 6 - 1 6 0 . Hopcroft, J . E. and Tarjan, R. E . , Journal ACM (l974)

549-568.
21,
Weinberg, L . , IEEE Trans. on Circuit Theory

142-148.
(1966)
CT-13,
Harary, F . , "Graph Theory," l04, Addison-Wesley, Reading, Mass., 1 9 6 9 . (40) Hopcroft, J . E . , in Kohavi, Z. and Paz, ., eds., "Theory of Machines and Computations," 189-196. Academic Press, N w York, 1 9 7 1 . e (41) Hopcroft, J . E. and Tarjan, R. E . , Journal of Computer and System Sciences ( 1 9 7 3 ) 7 , 3 2 3 - 3 3 1 . (42) Hopcroft, J . E. and Wong, J . K., Sixth Annual A M Symp. on C Theory of Computing ( 1 9 7 3 ) , 1 7 2 - 1 8 4 . ( 4 3 ) Fontet, M., Proc. Third International Colloquium on Automata, Languages, and Programming, to appear. (44) Bunch, J . R. and Rose, D. J., eds., "Sparse Matrix Computations," Academic Press, N w York, 1 9 7 6 . e (45) Duff, I. S., "A Survey of Sparse Matrix Research," Technical Report CSS 5 2 8 , Computer Science and Systems Division, AERE Harwell, 1 9 7 6 . ( 4 6 ) Rose, D. J . and Willoughby, R., eds., "Sparse Matrices and their Applications," Plenum Press, N w York, 1 9 7 2 . e (47) Benzer, S., Proc. of the National Academy of Sciences ( 1 9 5 9 )
45, 1607-1620.
(39)
(48) (49)
Lueker, G. S. and Booth, K. S., Seventh ACM Symp. on Theory of Computing ( 1 9 7 5 ) , 2 5 5 - 2 6 5 . Booth, K. S., "P-Q Trees," Ph.D. Thesis, Dept. of Electrical Engineering and Computer Sciences, University of California, Berkeley, 1 9 7 5 .
2 Algorithm Design in Computational Quantum Chemistry ERNEST R. DAVIDSON C e i t y D p . University o W s i g o , S a t e W 9 1 5 h m s r et, f a h n t n etl, A 89
Quantum chemistry is a diverse d i s c i p l i n e which uses many different methods to correlate a wide variety of phenomena. In the earliest period of the subject the Schrdinger equation was solved exactly for a few simple model situations. These model solutions were then used to interpret the spectra, kinetics, and thermodynamics of molecules and solids. During this period, accurate solutions for the electronic structure of helium (1) and the hydrogen molecule (2) were obtained i n order to verify that the Schrdinger equation was useful. Most of the e f f o r t , however, was devoted to developing a simple quantum model of electronic structure. Hartree (3) and others developed the self-consistent-field model for the structure of light atoms. For heavier atoms, the Thomas-Fermi model (4) based on t o t a l charge density rather than individual orbitals was used. Models for the electronic structure of polynuclear systems were also developed. Except for metals, where a free electron model of the valence electrons was used, all methods were based on a description of the electronic structure i n terms of atomic o r b i t a l s . Direct numerical solutions of the Hartree-Fock equations were not feasible and the Thomas-Fermi density model gave ridiculous results. Instead, two different models were introduced. The valence bond formulation (5) followed closely the concepts of chemical bonds between atoms which predated quantum theory (and even the discovery of the electron). In this formulation certain reasonable "configurations" were constructed by drawing bonds between unpaired electrons on different atoms. A mathematical function formed from a sum of products of atomic o r b i t a l s was used to represent each configuration. The energy and electronic structure was then 2 1
22
ALGORITHMS
FOR
CHEMICAL
COMPUTATIONS
found by the l i n e a r v a r i a t i o n method ( a l s o c a l l e d "resonance" or " c o n f i g u r a t i o n i n t e r a c t i o n " ) . Because of i t s almost one-to-one correspondance w i t h e a r l i e r chemical concepts the valence bond model gained widespread acceptance ( 6 ) . The molecular o r b i t a l model (7) assumed, instead, that the e l e c t r o n s were i n c e r t a i n molecular o r b i t a l s which could be expressed as l i n e a r combinations of atomic o r b i t a l s . C o n f i g u r a t i o n s were then cons t r u c t e d as various ways of arranging e l e c t r o n s i n o r b i t a l s . The molecular o r b i t a l model gave a c l e a r i n t e r p r e t a t i o n of molecular spectra but was l e s s transparent than the valence bond method i n modeling geometrical s t r u c t u r e of molecules (6,8). In almost a l l e a r l y app l i c a t i o n s of valence bond (9) and molecular o r b i t a l (10) models the i n t e g r a l s encountered were too d i f f i c u l t to a c t u a l l y evaluate so e m p i r i c a l values of the i n t e g r a l s were assumed which reproduced the phenomena being studied. With the advent of the stored-program d i g i t a l computer a minor r e v o l u t i o n occurred i n quantum chemistry. The i n t e g r a l s appearing i n the models being used f o r small molecules were a c t u a l l y evaluated and i t became c l e a r that molecules were enormously more complicated than had been a n t i c i p a t e d . The o v e r s i m p l i f i e d valence bond and molecular o r b i t a l methods o f t e n gave q u a l i t a t i v e l y r i d i c u l o u s r e s u l t s when taken l i t e r a l l y (11). As a consequence of these negative r e s u l t s , the f i e l d of ab i n i t i o quantum chemistry developed w i t h the goal of f i n d i n g computer algorithms f o r s o l v i n g the Schrdinger equation. The prospect of o b t a i n i n g r e l i able r e s u l t s f o r molecular systems not s u s c e p t i b l e t o d i r e c t measurement ( r e p u l s i v e p o t e n t i a l energy surfaces, upper atmosphere free r a d i c a l s , etc.) and c l a r i f y i n g the i n t e r p r e t a t i o n of experimental r e s u l t s which do not f o l l o w simple models a t t r a c t e d i n t e r e s t i n t h i s f i e l d i n s p i t e of the e x t r a o r d i n a r y expense of the approach and the lack of chemical i n s i g h t i n the e a r l y r e s u l t s . In the ab i n i t i o approach the d e s i r e d answers are the experimental observables - s p e c t r a l l i n e p o s i t i o n s , shapes, i n t e n s i t i e s ; s c a t t e r i n g and r e a c t i o n r a t e s ; p o l a r i z a b i l i t i e s and o p t i c a l r o t a r y power; etc. These are t o be obtained from the Schrdinger equation by numerical methods which are mathematically w e l l - d e f i n e d and i n v o l v e no intermediate parameters not appearing i n the Schrdinger equation i t s e l f . Usually the Born-Oppenheimer separation of nuclear and e l e c t r o n i c coordinates i s assumed and small terms i n the hamiltonian, such as s p i n - o r b i t coupling, are neglected i n the f i r s t approximation. P e r t u r b a t i o n
2.
DAVIDSON
Computational Quantum Chemistry
23
theory may be used to c o r r e c t f o r these approximations by coupling e l e c t r o n i c s t a t e s i n the next l e v e l of approximation. Figure 1 o u t l i n e s the r e l a t i o n s h i p between various steps i n the c a l c u l a t i o n of some e x p e r i mental observables. C e n t r a l t o a l l other steps i s the c a l c u l a t i o n of the a d i a b a t i c e l e c t r o n i c wavefunctions f o r a l l s t a t e s of i n t e r e s t . From the wavefunctions one can obtain f i r s t order p r o p e r t i e s and coupling matrix elements f o r e s t i m a t i n g c o r r e c t i o n s due t o coupling of s t a t e s by non-adiabatic o r s p i n - o r b i t e f f e c t s . Methods which by-pass the wavefunction such as or d e n s i t y f u n c t i o n a l models (12) are not yet s u f f i c i e n t l y general to t r e a t t h i s wide c l a s s of chemical problems. Each box i n Figure 1 represents i t s own p e c u l i a r computing problems. The algorithms f o r v a r i o u s steps are at various l e v e l s of s o p h i s t i c a t i o n depending on the r e l a t i v e cost, d i f f i c u l t y , and i n t e r e s t i n the r e s u l t s . The i n i t i a l c a l c u l a t i o n of e l e c t r o n i c wavef u n c t i o n s and energy surfaces have preoccupied quantum chemists f o r t h i r t y years. The c a l c u l a t i o n of a d i a b a t i c s c a t t e r i n g and r e a c t i o n rates has received much a t t e n t i o n i n recent years (JJ3). The accurate c a l c u l a t i o n of v i b r a t i o n a l - r o t a t i o n a l l e v e l s i s nearly as d i f f i c u l t but has received l i t t l e a t t e n t i o n u n t i l very r e c e n t l y . Equally accurate formalisms i n the coupled s t a t e model do not e x i s t because no general a l g o r i t h m e t r i c formalism e x i s t s f o r handling the e l e c t r o n i c part of the problem. No v i b r a t i o n a l - r o t a t i o n a l spectrum has yet been computed from an ab i n i t i o approach t a k i n g f u l l account of BornOppenheimer coupling i n a Jahn-Teller-Renner s i t u a t i o n . Generally speaking the whole area of coupled e l e c t r o n i c s t a t e c a l c u l a t i o n s l a c k s a workable algorithm. F i r s t order p e r t u r b a t i o n theory, while suggestive, i s o f t e n not a q u a n t i t a t i v e t o o l . The r e s t of t h i s paper w i l l deal e x c l u s i v e l y w i t h algorithms f o r c o n s t r u c t i o n of e l e c t r o n i c wavefunctions because these are c e n t r a l t o the o v e r a l l problem. In order t o appreciate the methods used, one must r e c a l l that we are i n t e r e s t e d i n s o l v i n g a p a r t i a l d i f f e r e n t i a l equation eigenvalue problem f o r s e v e r a l wavefunctions at s e v e r a l d i f f e r e n t arrangements of the n u c l e i . This d i f f e r e n t i a l equation i n v o l v e s one- and two-body opera t o r s i n the p o t e n t i a l energy operator and p a r t i a l d e r i v a t i v e s w i t h respect t o 3N coordinates (where i s the number of e l e c t r o n s ) . For benzene, f o r example, there are 12 n u c l e i and 42 e l e c t r o n s . The reasonable a s p i r a t i o n of f i n d i n g the e q u i l i b r i u m geometry and f o r c e constants f o r the f i r s t 10 s t a t e s would i n v o l v e s o l v i n g a p a r t i a l d i f f e r e n t i a l
24
COUPLED STATE REACTION RATE PERTURBED VIBRATIONALROTATIONAL L E V E L S , JAHNTELLER-RENNER EFFECTS NUCLEAR-ELECTRONIC COUPLED MOTION
NATURAL ORBITALS
GET APPROXIMATE POTENTIAL SURFACES AND E L E C T R O N I C WAVEFUNCTIONS FOR STATES AND GEOMETRIES OF INTEREST USING BORN-OPPENHEIMER APPROXIMATION A N D ONLY COULOMB I N T E R A C T I O N S
CORRECT WAVEFUNCTIONS FOR P E R T U R B A T I O N S (SPIN-ORBIT, EXTERNAL FIELD, RELATIVISTIC, ETC.) WITHIN BORN-OPPENHEIMER APPROXIMATION
CHEMICAL INTER * PRETATION DENSITY MATRICES
NUCLEAR MOTION ADIABATIC REACTION RATES VIBRATIONAL AVERAGED PROPERTIES POLARIZABILITY
"U" E L E C T R O N I C
F I R S T ORDER CORRECTIONS TO ENERGY DISTRIBUTION OF SPIN, CHARGE A N D MOMENTUM

Figure 1.
TRANSITION RATES AND L I F E T I M E S
Flow chart jot ab initio calculions
2.
DAVIDSON
25
equation of t h i s type i n 126 independent v a r i a b l e s . The only reason i t i s p o s s i b l e here i s that (1) the f i x e d f i e l d due to the n u c l e i dominates over the e l e c t r o n - e l e c t r o n r e p u l s i o n so the e l e c t r o n i c motions are u s u a l l y not s t r o n g l y coupled to each other, (2) i t i s impossible f o r a large c o l l e c t i o n of mutuallyr e p u l s i v e p a r t i c l e s t o avoid each other i f they are constrained to remain i n the same region of space, and (3) e l e c t r o n s are i n d i s t i n g u i s h a b l e so the coordinates are permutational equivalent. Hence the antisymmetric independent p a r t i c l e approximation which leads t o a pseudo-separation of v a r i a b l e s i s often a good f i r s t approximat ion. Now consider the resources a v a i l a b l e f o r s o l v i n g t h i s (or a s i m i l a r ) problem i f some government agency decides these r e s u l t s are v i t a l to the n a t i o n a l welfare. I t would then be p o s s i b l e t o spend up to 1 0 hours of CDC7600 time on t h i s problem (about $10,000,000). This w i l l allow about 1 0 a r i t h m e t i c operations ( a d d i t i o n o r ^ m u l t i p l i c a t i o n ) . Also we can assume^that at most 10 words of high speed core memory, 10 words of low^ speed core, 1 0 words of d i s k or drum storage, and 10 words of s e q u e n t i a l tape storage are a v a i l a b l e . By present standards t h i s would be a very large c a l c u l a ^ t i o n s i n c e every member given here i s a f a c t o r of 10 l a r g e r than what i s t y p i c a l l y used. If one wavefunction at one set of nuclear c o o r d i nates were sought by numerical i n t e g r a t i o n using only two p o i n t s i n each coordinate, a g r i d of 2 ^ 10 p o i n t s would be r e q u i r e d . I f spin and antisymmetry are taken i n t o account the s i t u a t i o n i s even worse. Since no two e l e c t r o n s can be at the same point with the same spin at l e a s t p o s i t i o n s must be considered f o r each e l e c t r o n and the minimum g r i d contains 42! = 1 0 points i n 3N space. The only method found so f a r which i s f l e x i b l e enough to y i e l d ground and e x c i t e d s t a t e wavefunctions, t r a n s i t i o n r a t e s and other p r o p e r t i e s i s based on ex panding a l l wavefunctions and operators i n a f i n i t e d i s c r e t e set of b a s i s f u n c t i o n s . That i s , a set of onep a r t i c l e s p i n - o r b i t a l s {.} are s e l e c t e d and the wavefunction i s expanded i n S l a t e r determinants based on these o r b i t a l s . A d i r e c t expansion would r e q u i r e w r i t i n g as
4 1 4 8 1 2 6 3 8 5 1 =1
= det(<f)
X
,
X
,.... ) N
X
l < i < <... <. <D N

9 1 Z X
26
ALGORITHMS FOR
CHEMICAL COMPUTATIONS
Since the number of p o s s i b l e S l a t e r determinants i s (JST), t h i s again gives an exponential dependence on N. For example, the simplest chemically reasonable o r b i t a l b a s i s set f o r benzene has 72 spin o r b i t a l s and (4g)~10 . C l e a r l y t h i s expansion method i s f e a s i b l e only i f very few of the S l a t e r determinants a c t u a l l y c o n t r i b u t e to each of the f i r s t few wavefunctions. Hence a method i s required f o r c o n s t r u c t i n g the o r b i t a l s so that i t i s known i n advance that r e l a t i v e l y few of the w i l l be important. The standard method f o r s e l e c t i n g the < > i s to ask lj for the ^ which maximize the importance of one or more terms i n the sum. This gives the s e l f - c o n s i s t e n t - f i e l d (SCF) or m u l t i c o n f i g u r a t i o n SCF (MC-SCF) equations. I f each ^ i s expanded as a l i n e a r combination of some f i x e d set of b a s i s f u n c t i o n s { f i } i = i coefficients can be found by an extension of the Roothaan SCF equations. Figure 2 gives an o u t l i n e of the steps i n t h i s approach along w i t h the cost ( i n machine operations) of each step. For benzene t h i s s t i l l r e q u i r e s about 10 operations to form a l l the i n t e g r a l s r e q u i r e d to r e present the energy operators, i n the simplest reasonable b a s i s set (d=36), 10' operations to f i n d one SCF wavef u n c t i o n , 10 operations to form the i n t e g r a l s over molecular o r b i t a l s and about 10 operations to obtain a good expansion f o r the wavefunction. I f 10 wavefunct i o n s were wanted at 10 nuclear arrangements the t o t a l cost would approach 1 0 operations. Further, i f a good b a s i s set were used i n c l u d i n g Rydberg o r b i t a l s which are known to be important f o r some of the lowest e x c i t e d s t a t e s the number of b a s i s f u n c t i o n s could e a s i l y be quadrupled and the number of a r i t h m e t i c operations would be very nearly 1 0 . In t h i s example the storage a v a i l a b l e would present no problem although a l l of the i n t e g r a l s would not f i t i n t o high speed core at one time. In the f o l l o w i n g s e c t i o n s of t h i s paper some of the algorithms i n v o l v e d i n the various steps shown i n Figure 2 are presented i n d e t a i l . Emphasis i s placed on concepts which might be u s e f u l outside of quantum chemistry. From the previous d i s c u s s i o n i t should be c l e a r , however, that ab i n i t i o c a l c u l a t i o n s are i n h e r e n t l y expensive. Since few research p r o j e c t s can a f f o r d to use more than 10^- a r i t h m e t i c operations or words of memory (of a l l s o r t s ) only r e l a t i v e l y small molecules can be t r e a t e d i n d e t a i l . For medium s i z e molecules one must be content w i t h SCF c a l c u l a t i o n s at only a few nuclear arrangements. For very l a r g e mole2 t n e s s 8 3 1 3 15
DAVIDSON
ARITHMETIC OPERATIONS SELECT BASIS SET if.yl,
FORM
< f 1
INTEGRALS i N V
J
4 5,2 lOOd* t o 5x1 O V
FORM S C F O R B I T A L S . = z ..f.
a
20c
,4
t Q
FORM
1
INTEGRALS
>
J
<,
( o r IT<r)
20,
SELECT CONFIGURATIONS by p e r t u r b a t i o n theory o r other r u l e s keep c o n f i g u r a t i o n s
100N
2 v
(d--) ~V
FORM C I M A T R I X FORMULA ELEMENT
,2 5 0
N K
2 2
/M / N
25 K / N ~
FIND EIGENVECTOR AND ENERGY
JZ, 2 '
H
FORM DENSITY & MOLEC PROP. Figure 2.
1 0 0 r K / N ( d - ^ l or 50d Z
0 0 L K / N ( U U
Unit operations in calcuhting a wavefunction
28
ALGORITHMS
FOR
CHEMICAL
COMPUTATIONS
cules (more than 500 valence e l e c t r o n s ) i n the absence of symmetry, even the crudest c a l c u l a t i o n becomes e x c e s s i v e l y expensive. Integral Calculation The i n t e g r a l s i n v o l v e d i n t y p i c a l quantum chemical c a l c u l a t i o n s are of the form (17,18)
B
ij
/ i<>
j(>
f
dT
and Q i j k * /<(i) j(i:i)(z )f (i: )f,(i: )dT dT

12 k 2 2 1 2
where i s V , V, r , r : r , r " , Y / r ,
l m Y r e t c m 2 2
Y 2 m
/ >
r 3
e t c
and i s r ~ * , i ^ i ) / i > The b a s i s f u n c t i o n s f . must t h e r e f o r e be chosen as a compromise between t e best r e p r e s e n t a t i o n of the waveh f u n c t i o n (which r e q u i r e s the fewest f. and hence fewest i n t e g r a l s ) and the e a s i e s t f u n c t i o n s to i n t e g r a t e . For atoms, S l a t e r o r b i t a l s , r . (), and numerical o r b i t a l s , R(r)Y () with R giver? numerically, are s u f f i c i e n t l y accurate and simple. For diatomics, S l a t e r o r b i t a l s have remained the best choice because the i n t e g r a l s can be done w i t h reasonable e f f o r t . P o l y atomic c a l c u l a t i o n s , however, were blocked f o r many years because of the d i f f i c u l t y of e v a l u a t i n g e l e c t r o n r e p u l s i o n (r-[jj>) i n t e g r a l s w i t h S l a t e r o r b i t a l s . I t has been known f o r some time that gaussian o r b i t a l s , x y ^ z exp(-ar^), have c e r t a i n p e c u l i a r p r o p e r t i e s which make the i n t e g r a l s r e l a t i v e l y easy to o b t a i n (14). On the other hand t h i s f u n c t i o n a l form i s not much l i k e the wavefunction of a coulomb p o t e n t i a l so more f u n c t i o n s are r e q u i r e d . In recent years a compromise has been found which p r e s e n t l y dominates polyatomic c a l c u l a t i o n s . Each f u n c t i o n f-^ i s expanded as a l i n e a r combination of gaussian o r b i t a l s ( f i s then c a l l e d a contracted gaus s i a n f u n c t i o n ) . Since t h i s i s b a s i c a l l y a numerical f i t t i n g procedure, v a r i o u s choices have been suggested for the c o n t r a c t i o n scheme. The most popular choices are p r e s e n t l y Pople's approximations (15) to S l a t e r o r b i t a l s and Dunning*s approximations (16) to f r e e atom Hartree-Fock o r b i t a l s . Because they are the most d i f f i c u l t and most numer ous of the i n t e g r a l s r o u t i n e l y needed, l e t us consider the e l e c t r o n r e p u l s i o n i n t e g r a l s
n m
2.
DAVIDSON
29
[ij||k] = /g*( )gj( ) i 2 k - 2 ~ 2

1 1
)g
)dx
l ' 2 are
i n more d e t a i l f o r the case that a l l of the simple normalized gaussian "lobes"

g ] L
( r ) = N.f.Cr)
2
f. = exp(-aJr-R,.| ) . = ( 2 . / )
3/4
centered at p o s i t i o n s R^ r e s p e c t i v e l y . This i s a "fourcenter" i n t e g r a l i f a l l the p o s i t i o n s are d i f f e r e n t and i s extremely d i f f i c u l t t o evaluate using any other type of b a s i s f u n c t i o n s . For gaussians, however *i<l>*J<l> i j p ^ l > where R a f and
K p K f
= (a.R.
a.R.)/(a
a.)
= a. + a. = exp(-a |r-R I ) ' ' '

2 2
i j - Nêxpi-a.jlR.-R.I) ij = i j/< i
a a a + a
j>
so the i n t e g r a l reduces e a s i l y t o a two center i n t e g r a l

CiJllkA] - K ^ K ^ / f p C r ^ r - ^ C r ^ d T ^ T ^
This may be f u r t h e r s i m p l i f i e d by the change of variables r = iC^+Ig), r

=
1 2
= r - r to obtain
1 2
V*L>V > V*> t<*12>

2
30
ALGORITHMS
FOR
CHEMICAL
COMPUTATIONS
where
a
V q
C
s * t "
=
and
a a
p(tel2-^,> V-fel2- q
+ R
) ] / a
R. t Hence
R R -p -q
[ij||k] - K . . K J d T f ( r ) r - l / d T
k 1 2 t 1 2
f <r)
8
The r i n t e g r a t i o n can now be done to give

Jdx f (r)=
g
(/ )
8
3 / 2
Since t h i s i s independent of R i s l e f t with

s
(and hence of r
L
1 9
) , one
[ij||k] = K . . K
k J l
(Tr/a ) / /dT f (r )r^.

s 1 2 t 1 2
The angular i n t e g r a t i o n i n t h i s f i n a l three dimensional i n t e g r a l i s e a s i l y done i f a s p h e r i c a l coordinate system i s introduced w i t h the a x i s chosen along R^: / 12 t 12> 2
d T f ( r r
= /cir r
1 2
/ sined6/ êxp(-a r2 )
7 r 2 t 2
expi-a^R^)exp(-2a r R cos9)
t 12 t
=
33-<1 [(-
12
|r
1 2
-R
I )
t t
2
-exp(-a |r +R | )] = 2" : / dr e x p ( - r ) .
t 12 t 3 / 2 1 : 2
The remaining i n t e g r a l i s c l o s e l y r e l a t e d to the e r r o r function
2.
DAVIDSON
31
e r f ( t ) = 2TT * / d r
t
exp(-r )
Because t h i s expression f o r [ i j | | k A ] reduces t o 0/0 f o r R = 0, i t i s customary t o define a r e l a t e d a u x i l a r y function

t
so
F (T) = T~*/*dr e x p ( - r ) ^
5 2
[ijllkt]
= - ^ ^
a
( )
t s
if
= aR .
t t 2
I f the overlap charge s

(ir 3/2K
ij /<>^>* - /V ij
j S k A
i s introduced, CiJllkA] = S . a * 2-* ()
For l a r g e , e r f ( / T ) + l and the formula f o r [ij||k] reduces t o that f o r two charges of magnitude S j i and Sfc i n t e r a c t i n g at a distance R^. For small T, F (T)+1 and [ij||k] corresponds t o the overlap charges i n t e r a c t i n g at an average distance of (/4,) . For a l l T, F ( T ) < 1 so
0 2 Q
[ i j | |k3/S S
iJ
kJl
< 2**.
Since
< 1 and a
< a,
0 < [ i j I |k] < 2"* *S.
Contracted gaussian lobes ( i . e . combinations of only simple gaussians) are f r e q u e n t l y used as b a s i s functions (21). For l a r g e molecules the lobes may be
32
ALGORITHMS
FOR
CHEMICAL
COMPUTATIONS
centered i n widely s c a t t e r e d p a r t s of the molecule so that most of the S i j overlap charges are q u i t e small (< 1 C T ) . The energy and wavefunction seem to depend only on f i x e d point accuracy i n the i n t e g r a l s [ i . e . , 10-6 absolute (not r e l a t i v e ) e r r o r i n each i n t e g r a l gives about 10~ absolute e r r o r i n energy]. Hence most i n t e g r a l s do not need to be evaluated f o r large molecules. Further, many of the i n t e g r a l s can be eliminated by a t e s t based only on one charge d i s t r i b u t i o n . Thus, although ^ d i n t e g r a l s need to be done f o r small molecules, only ^ d i n t e g r a l s are needed f o r large molecules. Those i n t e g r a l s which remain to be done can be w r i t t e n so they i n v o l v e one exponential, one square root, and e i t h e r F ( T ) or e r f ( / T ) . Each of these three functions i n v o l v e about the same amount of time althougl the square root can be made 30% f a s t e r than the stan dard square root r o u t i n e furnished with the computer software package. Since b i l l i o n s of these b a s i c [ij||k] i n t e g r a l s must be evaluated i n a t y p i c a l large c a l c u l a t i o n , i t i s e s s e n t i a l that the f a s t e s t p o s s i b l e algorithm be used. In t h i s regard i t i s best to evalu ate F ( T ) f o r small and e r f ( / T ) f o r l a r g e T. By j u d i c i o u s choice of i n t e r v a l s a short Chebyshev s e r i e s f o r F ( T ) or e r f ( / T ) can be found on each i n t e r v a l (19, 20). Although t h i s i n v o l v e s s t o r i n g about 4000 c o e f f i c i e n t s and p o i n t e r s , the r e s u l t i n g algorithm i s nearly twice as f a s t as one based on l a r g e r i n t e r v a l s and longer s e r i e s or on a Taylor s e r i e s f o r short i n t e r v a l s . This d i v i s i o n i n t o i n t e r v a l s i s s i m p l i f i e d by the f a c t that only 0 < < 30 need be considered since erf(/30) i s one to twelve s i g n i f i c a n t f i g u r e s . This a n a l y s i s i s t y p i c a l of the approach to e l e c tron r e p u l s i o n i n t e g r a l s . Use of c a r t e s i a n gaussian functions gives r i s e to a more general b a s i c i n t e g r a l (17)
13 6 4 2 Q Q 0
du S i m i l a r l y , S l a t e r o r b i t a l s f o r diatomic molecules give i n t e g r a l s of the form (22)

1
F (T) = J
e-
T u 2
2 n
< > =
l\
e~ u
a U
( u - ! ) * du
<l-u >""
2
and
(23)
" W " . - " V P V u )
2 +
'du
2.
DAVIDSON
33
Rather elaborate r e c u r s i o n r e l a t i o n s can be found f o r a l l these i n t e g r a l s when care i s taken to preserve numerical accuracy. Since u s u a l l y a l l values of are needed anyway, the intermediate values of as w e l l as the l a r g e s t value n=N and the smallest n=0 are u s e f u l . For example, the r e c u r s i o n r e l a t i o n (2n+l)F (T) = 2T F
n Q + 1
( T ) + e"
i s s t a b l e f o r r e c u r r i n g downward on but i s unstable f o r r e c u r r i n g upwards ( f o r small T/n) because (2n+l)F (T) * e-T. Consequently, e v a l u a t i o n of F ( T ) i n v o l v e s d i f f e r e n t schemes depending on the value of and T. For > N, upward recurrence from F i s p o s s i b l e without l o s s of s i g n i f i c a n t f i g u r e s . For < N, downward recurrence must be used s t a r t i n g from F ( T ) . For most f u n c t i o n s t h i s s i t u a t i o n would r e q u i r e e i t h e r a set of t a b l e s f o r every p o s s i b l e s t a r t i n g value of or e l s e one t a b l e f o r an N* greater than any which can occur followed by downward recurrence from N*. The p a r t i c u l a r f u n c t i o n F d e a l t w i t h here, however, obeys the r e l a t i o n s h i p
n Q N
n< > =
"W >
so the Taylor s e r i e s has the simple form
V > = j W o > <v > /

T
T
k !
The convergence r a t e of t h i s s e r i e s i s nearly independent of ( F k + i / F k 1 f o r small T) so a t a b l e of F ( T ) at a sequence of i n t e r v a l s of T f o r from zero N*+K (an i n t e r v a l width of 0.1 r e q u i r e s a of 6 f o r twelve s i g n i f i c a n t f i g u r e s ) s u f f i c e s f o r a l l values of and T. As f o r F , at l a r g e i t i s b e t t e r to evaluate a generalized error function
n + n + n Q Q Q
G (/T) =
n G
e" u du (*+4)G (/T) - T

n N +
u 2
2 n
n+1
( / )
Hence an e f f i c i e n t algorithm must recognize s e v e r a l ranges of T:
34
ALGORITHMS
FOR CHEMICAL
COMPUTATIONS
= 0 0<T<N N<T<T* max (N,T*)<t<T** T**<T<T***

F F F
= (2n+l)-1 by Taylor s e r i e s , recur down
by Chebyshev s e r i e s , recur up
G by Chebyshev s e r i e s , recur up G = 1 ' recur up ,

G
T;***<T
o > n 11 G +
( n + 4 ) G
where T**7, T***30, T***~30+3N i f 13 f i g u r e accuracy i s wanted. Self-Consistent-Field The simplest approximate wavefunction f o r an opens h e l l molecule i s the s p i n - u n r e s t r i c t e d Hartree-Fock function = (!)"* { ...
1 2 +1
... }
where i s the number of e l e c t r o n s and d
j = 1
J i J
are orthonormal s p i n - o r b i t a l s . The expectation value of the energy, <||>, i s a q u a r t i c polynomial E(c) i n the Nd v a r i a b l e s c. The orthonormality c o n s t r a i n t s form a set of s u b s i d i a r y quadratic c o n s t r a i n t s of the form G (c)
i
=0
i = I--L
The s e l f - c o n s i s t e n t - f i e l d algorithm i s an i t e r a t i v e method f o r f i n d i n g the c o e f f i c i e n t s c which minimize E(c) subject t o these c o n s t r a i n t s .
2.
DAVIDSON
35
This algorithm may be derived from the E u l e r Lagrange equations

3E/8C.. = IX
k 9
G /3
k
which are cubic i n c. The wavefunction i s unchanged by a u n i t a r y transformation among the spin-up or s p i n down o r b i t a l s . Roothann (24) has shown how t h i s a r b i t r a r i n e s s may be used t o change the Euler-Langrange equations t o the pseudo-eigenvalue form Z() = S
k k k
where F i s a quadratic polynomial i n the c c o e f f i c i e n t s (which i s s t i l l somewhat a r b i t r a r y ) . Since t h i s cubic equation cannot be solved e x p l i c i t l y , one can attempt an i t e r a t i v e s o l u t i o n i n the form Z( " )
( n 1 > < n ) k
= e S
k
( n ) k
Although t h i s equation i s u s u a l l y s t a t e d as the b a s i s of the i t e r a t i v e algorithm, i t o f t e n does not lead to r a p i d convergence ( 2j5). Consequently the F matrix i s u s u a l l y modified i n four d i f f e r e n t ways. (1) the a r b i t r a r i n e s s (26) i n the d e f i n i t i o n of F i s used t o ensure that the c o r r e c t i o n 6c t o c agrees with the Newton-Raphson s o l u t i o n of the Euler-Lagrange equations t o f i r s t order i n 6c. (2) the elements of F are extrapolated (27) from F(cA " )), (e(n-2)) ((n-1)) assuming each e l e ment converges g e o m e t r i c a l l y t o give F ( ). (3) o s c i l l a t i o n s are damped by averaging (27), with appropriate weights F ( ) and F ( c ( - - 0 ) t o give FC - ). (4} o s c i l l a t i o n s are damped by adding (26) t o Ffc(n-l)) root-shift IaiC ( - )cjVn-l)T obtain F C - l ) . The l a s t three o i these m o d i f i c a t i o n s have the property that F ( ) converges t o F ( c ) as c ( ) converges to c; so at convergence the cubic equation i s solved. These methods f o r c o n t r o l l i n g convergence of an i t e r a t i v e s o l u t i o n t o a complicated set of equations have wide a p p l i c a b i l i t y . The e x t r a p o l a t i o n and damping methods are based on well-known ideas f o r s i n g l e v a r i a b l e s while r o o t - s h i f t i n g may be a novel development by quantum chemists. S p i n - r e s t r i c t e d and m u l t i - c o n f i g u r a t i o n s e l f c o n s i s t e n t - f i e l d methods d i f f e r i n the assumed funcn 3 F ? a n d F N - 1 n _ 2 n 11 1 n 1 a j t o n N n
36
ALGORITHMS
FOR
CHEMICAL
COMPUTATIONS
t i o n a l form f o r . The b a s i c method f o r s o l v i n g the r e s u l t i n g cubic Euler-Lagrange equations remains s i m i l a r to that j u s t discussed. Configuration Interaction C o n f i g u r a t i o n i n t e r a c t i o n has come t o mean any expansion of the wavefunction i n a f i n i t e s e r i e s of Ne l e c t r o n f u n c t i o n s (28) = ^^,.,.,) where the C^ s a t i s f y the matrix eigenvalue equations
HC = E S C
H
IJ u -
<
'
<
Most CI c a l c u l a t i o n s involve c o n f i g u r a t i o n s formed from a common set of orthonormal o r b i t a l s by spin and symmetry adaptation of S l a t e r determinants. In t h i s case S i s a u n i t matrix and the formation of H i s greatly simplified. In most CI c a l c u l a t i o n s the H J J are f i r s t expressed i n terms of b a s i c i n t e g r a l s I, over orthonormal molecu l a r o r b i t a l s as
where the r ^ are i n t e g r a l independent c o e f f i c i e n t s which c o n s t i t u t e a "formula" f o r H J J . Generating the k ^ ^ most time consuming part of the formation of H . Since the ^*^ are dependent only on the i n d i c e s of the o r b i t a l s involved i n and J they may be used f o r s e v e r a l arrangements of the molecular n u c l e i (as long as the l a b e l s involved i n each c o n f i g u r a t i o n remain unchanged). I f i s predominantly one S l a t e r determinant, the c o e f f i c i e n t s C may be found by many-body-perturbation theory ( , 9 ). This theory provides an elegant scheme 2 f o r s i m p l i f y i n g the p e r t u r b a t i o n formulas by combining terms r e f e r r i n g t o the same I, i n t e g r a l s .
r Ic s t l l e
I J
2.
DAVIDSON
37
In the more general case, involves s e v e r a l S l a t e r determinants with large c o e f f i c i e n t s and c o r r e s ponds to an e x c i t e d s t a t e . In t h i s case no s i m p l i f i e d theory i s p o s s i b l e and must be constructed. The f i r s t step i n c o n s t r u c t i n g i s producing the l i s t of c o n f i g u r a t i o n s to be included. At a moderate l e v e l of accuracy only the SCF c o n f i g u r a t i o n and other c o n f i g u r a t i o n s nearly degenerate with i t need be considered. For higher accuracy more c o n f i g u r a t i o n s are needed. These c o n f i g u r a t i o n s may be c l a s s i f i e d as s i n g l y , doubly, t r i p l y , . . e x c i t e d depending on the l e a s t number of e x c i t a t i o n s required to form the con f i g u r a t i o n from one of the dominant ones. For f i x e d r e l a t i v e e r r o r i n the e x c i t a t i o n energy of a hydrocarbon molecule the number of spin o r b i t a l s increases i n pro p o r t i o n to the number of e l e c t r o n s , N. The number of k - f o l d e x c i t a t i o n s from any one S l a t e r determinant i s then p r o p o r t i o n a l t o N . I f a l l c o n f i g u r a t i o n s are used to a l l e x c i t a t i o n l e v e l s there are ^N non-zero e n t r i e s i n each row of and about rows (where i s a f i x e d number f o r f i x e d r e l a t i v e e r r o r and i s about 10 f o r a double zeta b a s i s s e t ) . As noted before, such a large r a t e of growth with cannot be t o l e r a t e d . Consequently most CI c a l c u l a t i o n s are run with l i m i t e d e x c i t a t i o n l e v e l s ( t y p i c a l l y only s i n g l e and double e x c i t a t i o n s ) . I t i s e a s i l y demonstrated, however, that t h i s procedure leads t o i n c r e a s i n g e r r o r as the number of e l e c t r o n s increases. In f a c t , f o r t i g h t l y l o c a l i z e d e l e c t r o n p a i r s , the domi nant e x c i t a t i o n l e v e l i s the value of k nearest Ô.OIN ( i . e . , f o r about 200 e l e c t r o n s the double e x c i t a t i o n s i n aggregate are more important than the SCF c o n f i g u r a t i o n and f o r 400 e l e c t r o n s quadruple e x c i t a t i o n s should dominate). Even f o r molecules with only 40 e l e c t r o n s quadruple and higher e x c i t a t i o n s must be considered i n order to reproduce e x c i t a t i o n energies (30) or p o t e n t i a l surfaces to an accuracy of 0.1 eV. Thus, c o n f i g u r a t i o n i n t e r a c t i o n c a l c u l a t i o n s f o r very large molecules are hopeless unless p e r t u r b a t i o n theory can be used t o c o r r e c t f o r unlinked c l u s t e r e f f e c t s . For t h i s reason, modern CI c a l c u l a t i o n s are r e a l l y l i m i t e d t o high accuracy c a l c u l a t i o n s on small mole cules. With t h i s l i m i t a t i o n both e x c i t e d and ground s t a t e s may be t r e a t e d with uniform accuracy provided the same procedure i s followed f o r each s t a t e . This requires a separate SCF c a l c u l a t i o n , i n t e g r a l t r a n s formation, and CI c a l c u l a t i o n f o r each desired s t a t e .
2 k 4
38
ALGORITHMS
FOR
CHEMICAL
COMPUTATIONS
Because of the l a r g e number of c o n f i g u r a t i o n s which can be constructed even w i t h j u s t double e x c i t a t i o n s , some a t t e n t i o n must be paid to l i m i t i n g the number which are important. This can be done by con s t r u c t i n g molecular o r b i t a l s which maximize the conver gence r a t e of the CI s e r i e s . Ordinary SCF o r b i t a l s o f f e r a reasonable s t a r t i n g set of occupied o r b i t a l s (although l o c a l i z e d o r b i t a l s may be b e t t e r ) . The SCF v i r t u a l o r b i t a l s can be improved, however, by use of approximate n a t u r a l o r b i t a l s (31). These o r b i t a l s are d i s t i n g u i s h e d by the f a c t that they are l a r g e s t i n the regions where the wavefunction e r r o r i s l a r g e s t . In terms of such l o c a l i z e d c o r r e c t i o n s only a few double e x c i t a t i o n s from each of the e l e c t r o n p a i r s are required f o r reasonable accuracy. The a c t u a l algorithm f o r e v a l u a t i n g H J J v a r i e s g r e a t l y between d i f f e r e n t research groups. The crudest, but most general, approach i s to assume each c o n f i g u r a t i o n i s formed as a short sum of S l a t e r determinants
" 1\>1
det
(*vl' v2--- vN
(f)
(t)
which produces a spin-eigenfunction from orthonormal s p i n - r e s t r i c t e d s p i n - o r b i t a l s ( i . e . , the spin-up and spin-down s p i n - o r b i t a l s occur i n p a i r s which d i f f e r only i n s p i n ) . Then H J J i s zero i f a l l of the S l a t e r determinants i n d i f f e r by at l e a s t three s u b s t i t u t i o n s from a l l of the determinants i n J. Since most matrix elements are zero, a r a p i d t e s t f o r t h i s con d i t i o n i s essential. Usually a configuration i s speci f i e d by the l i s t of s p a c e - o r b i t a l s (spin-independent) which occur i n every S l a t e r determinant i n the con f i g u r a t i o n . These space o r b i t a l occupations are s p e c i f i e d by two binary words where each b i t i s on or o f f i n one word depending on whether the corresponding o r b i t a l i s s i n g l y occupied or not and on or o f f i n the other word depending on whether the corresponding o r b i t a l i s doubly occupied. Boolean a r i t h m e t i c on these words can e a s i l y produce a word which i n d i c a t e s which occupations have changed and the b i t count of t h i s word can give the number of changes. For those H J J which have to be evaluated, there are d i f f e r e n t formulae depending on the number of o r b i t a l s by which and d i f f e r (28,32).

2.
DAVIDSON
39
Matrix
Manipulations
Storage. One of the more serious computational problems i n quantum chemistry i s the storage, manipula t i o n , and r e t r i e v a l of large arrays of r e a l numbers. I f some care i s not taken, a c a l c u l a t i o n may be need l e s s l y l i m i t e d by the storage capacity of c e n t r a l memory, d i s k s , o r tapes. The l a r g e s t arrays which occur i n c a l c u l a t i o n s are of two types. One a r i s e s from the e l e c t r o n r e p u l s i o n i n t e g r a l s and grows i n s i z e l i k e the f o u r t h power of the number of b a s i s f u n c t i o n s . The other i s the c o n f i g u r a t i o n i n t e r a c t i o n hamiltonian matrix which grows l i k e the square of the number of c o n f i g u r a t i o n s . Many other smaller arrays, whose s i z e i s p r o p o r t i o n a l t o the square of the number of b a s i s f u n c t i o n s , occur throughout the calculation. For non-symmetric matrices of dimension nxm with few zero e n t r i e s the most e f f i c i e n t storage p a t t e r n i s rectangular ( h e r e a f t e r r e f e r r e d t o as R) w i t h the l o c a t i o n of the i , j element computed as L i + n ( j - l ) . For r e a l symmetric matrices of dimension n, a t r i a n g u l a r pattern ( r e f e r r e d t o as T) i s used w i t h the l o c a t i o n of i , j computed as L = i + j ( j - l ) / 2 f o r i<j Cor L = j + i ( i - l ) / 2 f o r i > j ] . The CI hamiltonian matrix i s a large r e a l symmetric matrix w i t h mostly zero e n t r i e s (provided orthonormal c o n f i g u r a t i o n s constructed from orthonormal o r b i t a l s are used). I f more than h a l f the e n t r i e s are zero i t i s more e f f i c i e n t t o omit zero e n t r i e s and include the index as a l a b e l ( i f the word length i s long and the matrix i s small enough, t h i s l a b e l may be packed i n t o the i n s i g n i f i c a n t b i t s of the matrix element). The e l e c t r o n r e p u l s i o n i n t e g r a l s are more com p l i c a t e d t o store i f point group symmetry i s used t o reduce t h e i r number. In general the i n t e g r a l s may be c l a s s i f i e d i n t o blocks depending on the symmetry of the four o r b i t a l s involved i n the i n t e g r a l [ i , | | . I n t e g r a l s from the block l a b e l e d with symmetries , , , . can be stored i n s i x d i f f e r e n t p a t t e r n s : RRR, RTR, TTR, RTT, RRT, and TTT where the f i r s t l e t t e r t e l l s whether a rectangular ( - , ^ ) o r t r i a n g u l a r ( = ) p a t t e r n i s used t o compute the f i r s t charge distribution location Li2second l e t t e r i n d i c a t e s whether a rectangular ( ^ . ) o r t r i a n g u l a r ( 3 4 ) p a t t e r n i s used to compute the second charge d i s t r i b u t i o n l o c a t i o n L34 and the f i n a l l e t t e r i n d i c a t e s whether a rectangular ( \ ^ o r ^ ) or t r i a n g u l a r ( = and = ) pattern i s used to compute the i n t e g r a l l o c a t i o n
3
1 2 3 2
40
L
A L G O R I T H M S FOR C H E M I C A L
COMPUTATIONS
1234- Zero blocks are omitted, of course, and i t i s s u f f i c i e n t t o consider \ > , > , and - , ( - l ) / 2 + > ( - 1 ) / 2 + . Non-zero i n t e g r a l s over symmetry o r b i t a l s or molecular o r b i t a l s are u s u a l l y not small so no f u r t h e r s i m p l i f i c a t i o n i s p o s s i b l e . Non-zero i n t e g r a l s over atomic b a s i s functions may be q u i t e s m a l l , however, and l a r g e numbers of these can be omitted i f l a b e l s are r e t a i n e d .
2 4
Transformations. A f r e q u e n t l y o c c u r r i n g step i n c a l c u l a t i o n s i s a change of b a s i s v i a a l i n e a r t r a n s formation. That i s , a new set of b a s i s functions (such as molecular o r b i t a l s , group o r b i t a l s , n a t u r a l o r b i t a l s , etc.) are defined as l i n e a r combinations of the o r i g i n a l atomic o r b i t a l s , by ; (r) = I w . . f . ( r ) , i=l...d'<c JiJ
j = 1
Matrix elements of one-body hermitian operators (such as k i n e t i c energy, nuclear a t t r a c t i o n , the Fock operator, etc.) have the form
B
ij
JVl)*Bfj(r)dT The new
i n terms of the o r i g i n a l b a s i s f u n c t i o n s . matrix elements

B
i j
/g ()*Bg ()dT
i i
are e a s i l y computed from the B.. by the matrix t r a n s formation = W BW. I f symmetry i s considered, one may a l s o encounter unsymmetrical blocks of the matrix defined by
= /,
<>*
j,r ()
2
d T
where ^ , ^ i s the i f u n c t i o n i n symmetry block ^ . In t h i s case there w i l l be a d i f f e r e n t W(^) matrix f o r each symmetry block and one must compute a l l nonvanishing matrices of the form
2.
DAVIDSON
41
-(r ,r )
1 2
r = r anc o n
" -\ )?( , )( )
, 1 1 2 2
Thus, g e n e r a l l y , two matrix transformation algo rithms are required, one f o r stored t r i a n g u l a r l y ( l 2 ) * f o r stored r e c t a n g u l a r l y (^^2). The transformation could be w r i t t e n as a double sum
d
l> 2
lz \ v ) i j i-i...H .J-i...H D i r e c t e v a l u a t i o n i n t h i s form r e q u i r e s d^d2did2 m u l t i p l i c a t i o n . On the other hand the m u l t i p l i c a t i o n = B W followed by W , Y r e q u i r e s only z( LL2) I 2 ) ( 2) _ _ _ l 2 2 l l 2 m u l t i p l i c a t i o n s (or d ^ d ] ^ l 2 2 i f the m u l t i p l i c a t i o n s are done i n the opposite order). Figures 3,4 show an o u t l i n e of algorithms f o r the t r i a n g u l a r and rectangular cases f o r matrices small enough t o f i t e n t i r e l y i n t o high speed core. These algorithms are designed with one a d d i t i o n a l p r i n c i p l e i n mind. Namely, the only r e a l v a r i a t i o n between d i f f e r e n t ways of doing matrix m u l t i p l i c a t i o n i s the cost of indexing and amount of s c r a t c h storage used. Double s u b s c r i p t s should u s u a l l y be avoided and as f a r as p o s s i b l e matrix elements should be accessed s e q u e n t i a l l y . For t h i s reason i t i s best t o c a r r y out the rectangular transformation as Y = B W * followed by =
ij
d ( i
^kiV^Vk,*
T r d
/ p v r +
/r
/T
r
1
/P
- -( )
2 1
-( )
2
(2)' Scratch storage i s reduced by using each column of Y as soon as i t i s formed t o do the second m u l t i p l i c a t i o n . The t r i a n g u l a r transformation i s f u r thur complicated by the f a c t that both and are stored i n a t r i a n g u l a r p a t t e r n which increases the complexity of indexing. Transformation of the two e l e c t r o n i n t e g r a l s i s a much more time consuming step. I f R( ,i ,in,i.)is the integral
9 1 0 4
R(Vi ,i ,i Wf^^
2 3 4
and R i s the transformed

5 1 1 1
integral
(
(^' 2' 3' 4>"/ ^1>* <> 2 ^2>*8 :2^ 1 2

1 2 3 4
42
INC =
ICNT = 0
For i = 1... cT|
LCNT = 0
For I = 1...d
D() =
I A(k+ICNT)*B(k+LCNT) k=l
LCNT = LCNT + d
JCNT = 0
For j = ...INC i n steps o f cT|
x(j) =
2 I D(*C(X,+JCNT) = ,1
JCNT = JCNT + d
ICNT = ICNT + d
Figure 3.
Transformation of a real non-symmetric matrix, X = A BC

T
2.
DAVIDSON
43
ITRI = 0 ICNT = 0 For i = 1... c Tj LTRI = 0 CLEAR D U ) TO ZERO FOR i = ^...d For - 1...d I
1 ]
For k = 1...JM D U ) = D U ) + C(k+ICNT)*B(k+LTRI) D(k) = D(k) + CU+ICNT)*B(k+LTRI)
_ (
LTRI = LTRI+Jt D{SL) = D{i) + CU+ICNT)*B(LTRI)
i
JCNT = 0 ITR = ITRI+i-1
For j = ITRI... ITR

d
X( j+1 ) = I DU)*CU+JCNT) = ,1 JCNT = JCNT+d

1
ICNT = ICNT+d ITRI = ITRI+i
Figure 4.
Transformation of a real symmetric matrix, X
= C BC
44
ALGORITHMS
FOR CHEMICAL
COMPUTATIONS
the R and R are r e l a t e d by a four-index (tensor) trans formation. R(J ,J ,J ,J ) =

1 2 3 4
. (r
w
'
( 4) 4J4
R ( i l , i 2
'
i 3
'
i 4 )
D i r e c t e v a l u a t i o n of t h i s f o u r - f o l d sum would r e q u i r e 4 d d n d2d2d.3d3d d4 m u l t i p l i c a t i o n s t o form a symmetry block of R i n t e g r a l s . By c o n s t r a s t , s e q u e n t i a l oneindex transformations

1
j ,i ,i ,i ) = w*
1 2 3 4
( r i ) i i J i
R(i ,i
1
2 >
i ,i )
3 4
Y(J ,J >J J )
1 2 3 > 4
(r )i j
0 0
X ( j 0
, I
, I
, I
ZU^j^jg,^) = W*
3
( r 3 ) i 3 J 3
Y(J
1 )
J ,i
2
3 )
i )
4
R( J , 3 > J , J ) = I
x 2 3 4
1^
V j i j / ^ r ^ ' V V
4
1
4 4
r e q u i r e only d-id^d^dj^d^ + d d 2 d 2 d d + d d 2 d d d + d ^ d 2 d d d m u l t i p l i c a t i o n s . These transformations can be organized by t h i n k i n g of R ( i ^ , i , i , i ) f o r f i x e d i o i . as a matrix R( ^3^-4) which i s transformed l i k e a
3 4 1 3 3 4 3 4 4 2 3 4
3 4
3 4
11^2
+ 3 3 4 4
one-body operator t o give (i i ) (i i ) - ) W

+ ( 3 4
( r 2 >
(i i ) If the Y matrices are then reorganized t o give ( J 1 J 2 ) matrices by use of (io,i )

4 1 4
_(J J )
n 2
^ l '
^S'
2.
DAVIDSON
45
the R i n t e g r a l s can be formed from U l V
The use of s i x d i f f e r e n t storage patterns f o r the two-electron i n t e g r a l s r e q u i r e s s i x d i f f e r e n t algorithms for c a r r y i n g out the transformation. Only the simplest (RRR) w i l l be presented i n d e t a i l here (.33). Since the number of i n t e g r a l s u s u a l l y exceeds the amount of high speed core a v a i l a b l e (and u s u a l l y low speed core as w e l l ) a transformation using minimum core w i l l be d i s cussed (assuming d i s k i s l a r g e enough t o hold one block of R ) . Suppose the i n t e g r a l s R(i2^^) are o r i g i n a l l y arranged so that R(1>1), R(2,1),.. appear i n s e q u e n t i a l order on a s e q u e n t i a l f i l e . The range of ( 1 3 1 4 ) can be blocked i n t o d d / n groups of s i z e n (with a smaller group at the end i f needed). Each group of n R matrices can then be transformed by a standard two sub s c r i p t transformation t o leave n Y matrices i n sequen t i a l order ( i n the same space i n core o r i g i n a l l y occupied by the R m a t r i c e s ) . Storage f o r the W matrices
3 4 3 4 3 4 3 4 3 4
and one s c r a t c h region f o r wT

-(
I I
vR^ " " ^ are needed i n

)2
a d d i t i o n t o the space f o r the R arrays. The 3^3 sub s c r i p t s on each y ( 3 4 ^ array can a l s o be blocked i n t o d d / n 2 blocks of s i z e n and the Y arrays can be w r i t t e n t o d i s k i n blocks of s i z e n-^ ^Y 3 4 dom access f i l e . When a l l R matrices have been t r a n s formed, a block of ^1^2^ matrices i s e a s i l y formed i n core by reading a l l appropriate pieces from d i s k . The Y arrays can then be transformed by a standard two sub s c r i p t transformation and w r i t t e n t o a s e q u e n t i a l f i l e . This method r e q u i r e s d^d2n words of high-speed core for the i n i t i a l R arrays and d d n^2 words f o r the Y arrays. The intermediate random f i l e contains d d d q d / 1 2 3 4 blocks of s i z e n 3 4 which i s w r i t t e n and read only once. Maximum e f f i c i e n c y u s u a l l y r e q u i r e s making the product n 3 4 l a r g e as p o s s i b l e . Because t h i s i n t e g r a l transformation step involved d^ operations to transform d i n t e g r a l s i t has gained a r e p u t a t i o n as a bottleneck i n c a l c u l a t i o n s . A c t u a l l y , however, u n t i l d i s about 60 the formation of d4 i n t e g r a l s (over con t r a c t e d gaussian o r b i t a l s ) takes longer than the t r a n s 1 2 1 1 2 N A S A R A N _ 2 34 3 4 1 2 4 n N b v N 1 2 N a s 1 2 4
46
formation. For l a r g e r values of d i t i s l i k e l y that a CI matrix of l a r g e dimension w i l l be formed using these i n t e g r a l s ( o r a t h i r d or higher order p e r t u r b a t i o n c a l c u l a t i o n w i l l be done). U s u a l l y these uses of the i n t e g r a l s are more time consuming than t h e i r production so the transformation i s seldom the l i m i t i n g step. Eigenvalue algorithms. Matrix eigenvalue problems a r i s e i n quantum chemistry at both the SCF and CI l e v e l . The Roothaan SCF method r e q u i r e s s o l v i n g a non-ortho gonal eigenvalue problem of the dimension of the b a s i s set on each i t e r a t i o n f o r many of the eigenvalues and eigenvectors. The CI method u s u a l l y r e q u i r e s f i n d i n g the lowest few eigenvalues of a l a r g e matrix i n an orthonormal b a s i s of c o n f i g u r a t i o n s . Several algorithms e x i s t which are s u i t a b l e f o r f i n d i n g a l l of the eigenvalues of any matrix of dimen sion d which can be kept i n c e n t r a l memory. The Jacobi plane r o t a t i o n method i s by f a r the simplest t o program and i s reasonably e f f i c i e n t ( 3 4 ) . As i t i s an i t e r a t i v e method the running time cannot be r i g o r o u s l y defined, but times p r o p o r t i o n a l t o d are expected. Other methods u s u a l l y begin w i t h a n o n - i t e r a t i v e transforma t i o n t o t r i d i a g o n a l form followed by c a l c u l a t i o n of the eigenvalues and eigenvectors and a back transformation to the o r i g i n a l problem ( 3 4 , 3 5 ) . The time r e q u i r e d f o r the transformations i s p r o p o r t i o n a l to d w h i l e the time required to solve the t r i d i a g o n a l problem i s only pro portional to d . The Jacobi method i s g e n e r a l l y slower than these other methods unless the matrix i s nearly diagonal. In SCF c a l c u l a t i o n s one i s faced w i t h the non-orthogonal eigenvalue equation
3 3 2
F C = S C where i s the diagonal matrix of eigenvalues and C i s a matrix of eigenvectors. I f an o r t h o g o n a l i z i n g t r a n s f o r mation W i s known such that W SW=1, then
T
W F W W" or F where F and

1 T
C - W C
1 T
S W W"
C A
= C A
A
F W
C W C
2.
DAVIDSON
47
Usually on the f i r s t i t e r a t i o n of an SCF c a l c u l a t i o n W i s computed by the Schmidt o r t h o g o n a l i z a t i o n method but t h e r e a f t e r W i s chosen t o be the C matrix from the pre vious i t e r a t i o n . This produces an F' matrix which i s nearly diagonal so the Jacobi method becomes q u i t e e f f i c i e n t a f t e r the f i r s t i t e r a t i o n . Further, i n the Jacobi method, F i s d i a g o n a l i z e d by an i t e r a t i v e sequence of simple p l a n e - r o t a t i o n transformations F / N = ()' (n)Ht(n V eigenvectors of F can thus be generated a C = ((WX( )X(2)*"'X(n)) which avoids the m u l t i p l i c a t i o n of W by C . A disadvantage of the Jacobi method i s that the e r r o r i n the eigenvector i s u s u a l l y p r o p o r t i o n a l t o the square root of the e r r o r i n the eigenvalues. Thus, i n 8 d i g i t a r i t h m e t i c , only 4 f i g u r e s can be obtained i n the eigenvectors. The inverse i t e r a t i o n method of Wilkinson (34) i s a method which gives f u l l accuracy i n the vectors. This method i s based on computing the eigenvector as (X1.-F")C = X where i s the eigenvalue and X i s a guess to the eigenvector. Because t h i s method r e q u i r e s s o l v i n g a d i f f e r e n t set of l i n e a r equations f o r each eigenvector i t i s only f e a s i b l e i f F" has an e a s i l y i n v e r t e d form ( s o l v i n g l i n e a r equations i s a d process unless the c o e f f i c i e n t matrix has some s i m p l i f y i n g f e a t u r e ) . I f F" i s t r i d i a g o n a l , then the time f o r each vector i s p r o p o r t i o n a l t o d so the time for d vectors i s p r o p o r t i o n a l t o d . In CI c a l c u l a t i o n s i t i s necessary t o f i n d a few s o l u t i o n s t o the matrix eigenvalue problem
f f T h e f i n a l n + 1 v x) 3 2
H C = C 1 5 where i s of dimension from 10 t o 10 . For smaller dimensions i t i s most e f f i c i e n t t o use the standard t r i d i a g o n a l i z a t i o n r o u t i n e s . For matrices which are too l a r g e t o f i t i n t o high-speed core, s p e c i a l methods have been developed whose time per eigenvalue i s p r o p o r t i o n a l only t o the number of non-zero matrix elements ( d at most). These methods should be u s e f u l i n other areas of chemistry as w e l l . The f i r s t development i n t h i s area was the Nesbet method (36) f o r f i n d i n g the lowest (or highest) eigen value. This method was reorganized i n t o a b e t t e r algorithm by S h a v i t t (37) and then extended by S h a v i t t , et a l . (3j3) t o f i n d a few non-degenerate eigenvalues. Recently Davidson (39) has combined the fundamental ideas from Nesbet, Lanczos and inverse i t e r a t i o n schemes to form a method which works f o r the f i r s t few eigen values even i f they are degenerate. H i s method, however,
2
American Chemical Society Library 1155 10th St. N. W.

In Algorithms for Chemical D. C. 20036 Washington, Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
48
A L G O R I T H M S FOR
CHEMICAL
COMPUTATIONS
i n v o l v e s a l i t t l e more input-output than the Nesbet or S h a v i t t methods. The b a s i c concept of the Nesbet-Shavitt method i s based on i t e r a t i v e s e q u e n t i a l o p t i m i z a t i o n of the eigen vector elements. I f the q u a n t i t y p(C)=C HC/C C i s known for some C and p p ( C ) i s below a l l the diagonal elements of H, then s e q u e n t i a l m i n i m i z a t i o n of p(C) with respect to each element C [ i . e . s o l v i n g (3p/ac.) =0 and then stepping C? to the new C. before going to the next value of i ] gives
T T 0 = a 0
n o
6C. = C.-C where

= -[(-1)0]./(..-)
= p(C + e.C.) while f o r any value of 6C^ 26C,[(H-pl)C], + ( 6 C . ) [ H . . - p ] p(C+e.6C.)-p(C ) i ^ ^ ^ C C + (6C ) + 2C6C
2 0 1 1 i
Nesbet approximated
the optimum 6C^ by

0
6C. = -q^/CH..- ) where q = [(H-pl)C]

1 JL
while S h a v i t t found 6C. from the s l i g h t l y more exact formula 6C.=-2q / {..- + / ( . . - ) -4q [-q+q(H .-p ) ]/C C } .
2 0 T i:
Both of these formulas can be shown to give monotonie convergence f o r p . More importantly, S h a v i t t showed how use of the hermitian property of H could be used to w r i t e HC as (H). = I H
J
H j<i
C.
j H j>i
C
J J
so that Ej and H j ^ d i d not both need to be stored and read from e x t e r n a l s t o r e . S h a v i t t et a l . f u r t h e r
2.
D VD O AI S N
49
modified the Nesbet-Shavitt scheme to do excited states by introducing root-shifting and over-relaxation to speed convergences. Their method, however, often f a i l s to converge for nearly degenerate eigenvalues. Davidson introduced a different method for higher eigenvalues which also avoids the need to have the ele ments of H stored i n any particular order. In this method the kh eigenvector of H for the ^ iteration i s expanded i n a sequence of orthonormal vectors b i , =1 with coefficients found as the k eigenvector of the small matrix with elements b^HBj. Convergence can be obtained for a reasonably small value of i f the expansion vectors b are chosen appropriately. Davidson defined
( n ) k
l 4 l \
n) n)
( n )
- [H-p(c< )i]c
and chose b +i as the normalized residual when / ' was orthogonalized to the preceeding ]>]_-b. This choice for b +l i s similar to the Nesbet choice (and also to f i r s t order perturbation theory and the inverse i t e r a tion method). By the excited state variation theorem, the k-tfi eigenvalue of H as i t i s sequentially bordered w i l l decrease monotonically to the k- eigenvalue of H. Butscher and Kammer (40) have shown how a slight modification of this scheme which tracks on certain large elements of C rather than the index k can find a C with a certain desired pattern of coefficients without prior knowledge of the value of k and without finding any other eigenvectors.
n n n
Literature Cited 1. 2. 3. 4. 5. 6. Hylleraas, E.A., Z. Physik. (1930) 65, 209. James, H.M. and Coolidge, A.S., J. Chem. Phys. (1933) 1, 825. Hartree, D.R., Proc. Cambridge P h i l . Soc. (1928) 24, 89. Thomas, L.H. Proc. Cambridge P h i l . Soc. (1927) 23, 542. Slater, J.C., Phys. Rev. (1931) 38, 1109. Pauling, L., J. Am. Chem. Soc. (1931) 53, 1367.
50 7. 8. 9.
A G RT M F R C E I A C M U A I N L O I H S O H MC L O P T TO S
Mulliken, R.S., Phys. Rev. (1928) 32, 186. Walsh, A.D., J . Chem. Soc. (1953) 2260. Pauling, L. and Wheland, G W , J . Chem. Phys. .. (1933) 1, 362. 10. Pariser, R. and Parr, R.G., J . Chem. Phys. (1953) 21, 466. 11. See for example Karo, A.M. and Allen, L.C., J. Chem. Phys. (1959) 31, 968. 12. Johnson, K.H., Adv. Quantum Chem. (1973) 7, 143. 13. See for example Kouri, D.J., "Energy Structure and Reactivity", Smith, D W and McRae, W.B., Eds., .. John Wiley & Sons, 1973. 14. Boys, S.F., Proc. Roy. Soc. (1950) A200, 542. 15. Hehre, W.J., Stewart, R.F. and Pople, J.A., J . Chem. Phys. (1969) 51, 2657. 16. Dunning, T.H., J . Chem. Phys. (1970) 53, 2823. 17. Shavitt, I., i n "Methods i n Computational Physics", Vol.2, Alder, B., Fernbach, S. and Rotenberg, ., Academic Press, 1963. 18. Huzinaga, S., Supp. Prog. Theoretical Phys. (1967) 52. 19. Shipman, L.L., and Christoffersen, R.E., Comp. Phys. Comm. (1971) 2, 201. 20. Elbert, S.T., and Davidson, E.R., J . Comput. Phys. (1974) 16, 391. 21. Whitten, J.L., J . Chem. Phys. (1963) 39, 349. 22. Rdenberg, ., J . Chem. Phys. (1951) 19, 1459. 23. Corbato, F.J., J . Chem. Phys. (1956) 24, 452. 24. Roothaan, C.C.J., Rev. Mod. Phys. (1951) 23, 161. 25. Roothaan, C.C.J. and Bagus, P.S., i n "Methods i n Computational Physics", Vol. 2, Alder, ., Fernbach, S., and Rotenberg, ., Eds., Academic Press, 1963. 26. Guest, M.F., and Saunders, V.R., Mol. Phys. (1974) 28, 219. 27. Hsu, H., Davidson, E.R. and Pitzer, R.M., J. Chem. Phys., J . Chem. Phys. (1976) 65, 609. 28. For a thorough review see Shavitt, I., i n "Modern Theoretical Chemistry", Vol. 2, Schaeffer, H.F. I I I , Ed., Plenum Press, New York, 1976. 29. Mller, Chr., and Plesset, M.S., Phys. Rev. (1934) 44, 618. 30. Elbert, S.T., and Davidson, E.R., Int. J . Quant. Chem. (1974) 8, 857. 31. Davidson, E.R., "Reduced Density Matrices i n Quantum Chemistry, Academic Press, New York, 1976. 32. Davidson, E.R., Int. J . Quant. Chem. (1974) 8, 83. 33. Elbert, S.T., Ab initio Calculations i n Urea, Ph.D. thesis, University of Washington, 1973.
2 .
D VD O AI S N
51
34. 35. 36. 37. 38. 39. 40.
Wilkinson, J.H., "The Algebraic Eigenvalue Problem", Clarendon Press, Oxford, 1965. Givens, J.W., J. Assoc. Comp. Mach. (1957) 4, 298. Nesbet, R.H., J. Chem. Phys. (1965) 43, 311. Shavitt, I, J. Comput. Phys. (1970) 6, 124. Shavitt, I., Bender, C.F., Pipano, A. and Hosteny, R.P., J. Comput. Phys. (1973) 11, 90. Davidson, E.R., J. Comput. Phys. (1975) 17, 87. Butscher, W. and Kammer, W.E., J. Comput. Phys. (1976) 20, 313.
3
Rational Selection of Algorithms for Molecular Scattering Calculations ROY G. GORDON
Harvard University, Cambridge, MA 02138
S c a t t e r i n g theory i s the l i n k between i n t e r m o l e c u l a r f o r c e s , and the various experiments with molecular beams, gases, e t c . , which depend on collisions between molecules. T h i s l i n k i s used i n both d i r e c t i o n s : In the t h e o r e t i c a l approach the i n t e r molecular forces are used to p r e d i c t the outcome o f experiments. In the e m p i r i c a l approach, experimental r e s u l t s are i n v e r t e d or analyzed to o b t a i n information about the i n t e r m o l e c u l a r p o t e n t i a l . For most molecular s c a t t e r i n g phenomena, it i s u s u a l l y assumed that n o n r e l a t i v i s t i c quantum mechanics provides an accurate d e s c r i p t i o n . T h e r e f o r e , one might expect the field o f molecular collision phenomena to be n i c e l y u n i f i e d by the a p p l i c a t i o n o f n o n r e l a t i v i s t i c quantum-mechanical s c a t t e r i n g theory. Instead, one finds a b e w i l d e r i n g v a r i e t y o f methods, approximations, techniques, formulations and reformulations are used to t r e a t molecular collisions. One might be tempted to blame t h i s multitude o f approaches on the c o n c e i t o f the many t h e o r e t i c i a n s who have worked i n t h i s area, each developing h i s own p o i n t o f view. In f a c t , t h i s v a r i e t y i s more n e a r l y due to the f o l l o w i n g two circumstances: 1. Exact quantum mechanical s c a t t e r i n g c a l c u l a t i o n s are not yet f e a s i b l e f o r all types o f molecular collisions. Therefore some types o f approximations are necessary to t r e a t the quantum mechanically i n t r a c t a b l e cases. 2. The very r i c h n e s s and v a r i e t y o f molecular s c a t t e r i n g p r o cesses r e q u i r e that a number o f d i f f e r e n t approximation methods be used i n d i f f e r e n t s i t u a t i o n s . W b e l i e v e that s u i t a b l e methods have i n f a c t been e developed, to t r e a t s u c c e s s f u l l y almost all types o f molecular collisions. The question thus a r i s e s : How do we s e l e c t the most appropriate method f o r a given problem? In S e c t i o n II we d i s c u s s some criteria f o r choosing between methods. In s e c t i o n III we propose an e x p l i c i t algorithm f o r s e l e c t i n g the best a v a i l a b l e method f o r a given collision p r o c e s s , and for a given set o f experiments measuring that p r o c e s s . Then we apply t h i s algorithm to a number o f examples, mainly from i n e l a s t i c s c a t t e r i n g . It i s hoped that these examples will illustrate the way in which one
52
3.
GORDON
Molecular Scattering Calculations
53
should choose between methods, and the kind o f information such a choice r e q u i r e s . In a d d i t i o n , the examples d e s c r i b e d i n Section III are a l l chosen to represent r e a l cases f o r which c a l c u l a t i o n s have been completed, or are i n p r o g r e s s . Thus, they provide a guide to some recent a p p l i c a t i o n s o f each o f the methods d i s c u s s e d , and the reader h i m s e l f can evaluate the s t a t e o f the a r t i n a p p l i c a t i o n s o f each method. C r i t e r i a for Choosing an Appropriate S c a t t e r i n g Theory In order to make a r a t i o n a l s e l e c t i o n o f a s c a t t e r i n g theory to apply to a s p e c i f i c problem, we must formulate c r i t e r i a upon which t h i s choice i s to be based. It seems to us that there are three main c o n s i d e r a t i o n s : Feasibility. It i s necessary that the method be a p p l i c a b l e , i n a p r a c t i c a l sense, to the problem o f i n t e r e s t . Difficulties may occur at various stages: analytic d i f f i c u l t i e s (e.g. in e v a l u a t i n g matrix elements, or i n transforming coordinate systems); exceeding memory s i z e or running time o f computers; d i f f i c u l t i e s i n averaging and a n a l y s i s o f r e s u l t s i n t o a form to compare with experiments. Accuracy. The r e s u l t s must be s u f f i c i e n t l y accurate to i n t e r p r e t the experiments o f i n t e r e s t . In a complete quantummechanical c a l c u l a t i o n , t h i s accuracy can be v e r i f i e d by convergence t e s t s w i t h i n the c a l c u l a t i o n . In c l a s s i c a l , or other approximate methods, accuracy and r e l i a b i l i t y g e n e r a l l y must be judged by experience with t e s t comparisons with complete quantummechanical c a l c u l a t i o n s . The numerical s t a b i l i t y o f the method must a l s o be considered. Ease o f C a l c u l a t i o n . When more than one method meets the above c r i t e r i a o f f e a s i b i l i t y and accuracy, one has the luxury of choosing the e a s i e s t o f the p o s s i b l e methods. Some c o n s i d e r a t i o n s i n the "case" o f c a l c u l a t i o n might i n c l u d e the f o l l o w i n g : I f the e v a l u a t i o n o f the i n t e r a c t i o n p o t e n t i a l i s d i f f i c u l t (as i t i s l i k e l y to be i n any r e a l i s t i c c a s e ) , one would p r e f e r the method which r e q u i r e s the smallest number o f values o f the p o t e n t i a l . Other c o n s i d e r a t i o n s might be the complexity and cost o f the computer c a l c u l a t i o n s , and the a v a i l a b i l i t y o f well-documented and r e l i a b l e computer programs. Next we must d i s c u s s the s p e c i f i c methods o f c a l c u l a t i o n which we s h a l l recommend, i n the l i g h t o f the three c r i t e r i a discussed above. Quantum S c a t t e r i n g ("close coupling") (1). The f e a s i b i l i t y o f a f u l l quantum s c a t t e r i n g c a l c u l a t i o n depends mostly upon the
54
ALGORITHMS
FOR C H E M I C A L
COMPUTATIONS
number (N ) o f i n t e r n a l s t a t e s which are coupled together by the i n t e r a c t i o n p o t e n t i a l , during the strongest part o f the c o l l i s i o n . The most e f f i c i e n t quantum s c a t t e r i n g method c u r r e n t l y a v a i l a b l e i s based on piecewise a n a l y t i c s o l u t i o n to model p o t e n t i a l s which approximate the true p o t e n t i a l to any p r e s c r i b e d degree o f accuracy (2). Piecewise l i n e a r model p o t e n t i a l s u s u a l l y provide s u f f i c i e n t accuracy, along with an accurate and e f f i c i e n t a l g o rithm f o r the c a l c u l a t i o n s (2). More accurate model p o t e n t i a l s can now be based on piecewise quadratic approximations, for which an e f f e c t i v e s o l u t i o n algorithm has now been devised (3). While one can program t h i s method to work with whatever s i z e computer i s a v a i l a b l e (using d i s k storage i f n e c e s s a r y ) , the number o f d i s k accesses becomes r a t h e r large unless the computer memory i s l a r g e enough to store at l e a s t e i g h t N by N matrices (8 N ^ numbers). Up to about 100 N ^ m u l t i p l i c a t i o n s and a d d i t i o n s are r e q u i r e d to construct a single scattering matrix. These storage and t i m i n g r e s t r i c t i o n s c u r r e n t l y r e s t r i c t f e a s i b l e c a l c u l a t i o n s to N about 100 or l e s s . Thus a number o f approximations are being explored, which may reduce the number N . These include the use o f e f f e c t i v e Hamiltonians (4-9) and j conserving approximations (10-12). Very promising r e s u l t s are being o b t a i n e d , and these approximat i o n s should allow the use o f quantum s c a t t e r i n g methods to be used for a much wider range o f molecules. The accuracy o f the quantum s c a t t e r i n g r e s u l t s i s l i m i t e d mainly by the number o f i n t e r n a l s t a t e s included ( c l o s e - c o u p l i n g approximation). Therefore one must check that the p r e d i c t i o n s o f i n t e r e s t converge as one increases the number o f i n t e r n a l s t a t e s . The accuracy o f the r a d i a l i n t e g r a t i o n can be set at any p r e determined v a l u e . Further work (13) has s i m p l i f i e d the p e r t u r b a t i o n formulas f o r s e t t i n g the accuracy o f the r a d i a l i n t e g r a t i o n . The method was constructed to be n u m e r i c a l l y s t a b l e , and i n p r a c t i c e not more than two d i g i t s are l o s t i n roundoff e r r o r , even in calculations involving m i l l i o n s of arithmetic operations. As f o r ease o f c a l c u l a t i o n , o n l y a small number (say 30) o f r a d i a l i n t e g r a t i o n p o i n t s are r e q u i r e d , so t h a t not too many evaluations o f the p o t e n t i a l are necessary. A complete computer program f o r quantum-mechanical e l a s t i c and i n e l a s t i c s c a t t e r i n g i s available (14). The quantum theory o f r e a c t i v e s c a t t e r i n g i s not as h i g h l y developed as f o r i n e l a s t i c s c a t t e r i n g . No g e n e r a l l y a p p l i c a b l e algorithm has yet been p e r f e c t e d , p a r t i c u l a r l y f o r three-dimensional reactions. However, many promising approaches are being explored.
c c c c c c c z
D i s t o r t e d Wave Bom Approximation. Quantum s c a t t e r i n g c a l c u l a t i o n s are sometimes made u s i n g the d i s t o r t e d wave Born approximation (15). Such c a l c u l a t i o n s have the advantage o f almost always being f e a s i b l e n u m e r i c a l l y . For simple cases, one can a l s o o b t a i n some r e s u l t s a n a l y t i c a l l y (16). However, the accuracy o f the r e s u l t s i s g e n e r a l l y poor, f o r most molecular c o l l i s i o n s . A
3.
GORDON
55
necessary c o n d i t i o n f o r the r e s u l t s to be accurate, i s that a l l the c a l c u l a t e d t r a n s i t i o n p r o b a b i l i t i e s be small compared to unity. However, t h i s i s not a s u f f i c i e n t c o n d i t i o n , s i n c e small t r a n s i t i o n p r o b a b i l i t i e s can r e s u l t from f o r t u i t o u s c a n c e l l a t i o n o f l a r g e negative and p o s i t i v e c o n t r i b u t i o n s to the perburbation integrals. One can t e s t f o r t h i s p o s s i b i l i t y by checking whether the sum o f a l l the p e r t u r b a t i o n i n t e g r a l s remains small as we b u i l d them up by adding on c o n t r i b u t i o n s from the v a r i o u s r a d i a l intervals. T h i s provides both a necessary and s u f f i c i e n t c o n d i t i o n f o r the v a l i d i t y o f p e r t u r b a t i o n theory. C l a s s i c a l Mechanics. The d e s c r i p t i o n o f s c a t t e r i n g by c l a s s i c a l mechanics has the important advantage o f almost always being f e a s i b l e to c a r r y out. Only three circumstances o c c a s i o n a l l y make i t d i f f i c u l t to o b t a i n r e s u l t s with c l a s s i c a l s c a t t e r i n g theory: 1) There may be p o i n t s at which the coordinates chosen for i n t e g r a t i o n become s i n g u l a r or undefined (17). If a traject o r y approaches one o f these p o i n t s , the numerical i n t e g r a t i o n may break down. Such d i f f i c u l t i e s may be avoided by changing c o o r d i nate systems. 2) I f some coordinates change much more r a p i d l y than o t h e r s , the equations become d i f f i c u l t to i n t e g r a t e numerically. These d i f f i c u l t i e s may be reduced by u s i n g a c t i o n - a n g l e coordinates f o r the r a p i d l y v a r y i n g coordinates (18), and by u s i n g a very s t a b l e and accurate i n t e g r a t i o n technique, such as RungeKutta. 3) Some t r a j e c t o r i e s i n both i n e l a s t i c (19) and r e a c t i v e (20) c o l l i s i o n s are long and complicated, corresponding to resonances or l o n g - l i v e d c o l l i s i o n complexes. Unless one r e a l l y needs to know the d e t a i l s o f such c o l l i s i o n s , i t i s p r o bably best to use a s t a t i s t i c a l theory to describe the d i s t r i b u t i o n o f r e s u l t s f o r these c o l l i s i o n s . S e m i c l a s s i c a l Methods. The accuracy o f c l a s s i c a l c a l c u l a t i o n s i s u s u a l l y adequate when the experiments o f i n t e r e s t average over at l e a s t several quantum s t a t e s . I f , however, no c l a s s i c a l t r a j e c t o r i e s connect the i n i t i a l and f i n a l s t a t e s o f motion, the c l a s s i c a l p r e d i c t i o n i s a v a n i s h i n g cross s e c t i o n or r a t e constant f o r that p r o c e s s . The c o r r e c t quantum-mechanical p r e d i c t i o n may, however, be a small but non-zero r a t e for such a " c l a s s i c a l l y forbidden" p r o c e s s . "Tunneling" through a p o t e n t i a l b a r r i e r i s a simple example. The connection formulas i n the WKB method may be viewed as p r o v i d i n g a complex-valued t r a j e c t o r y which does l i n k the " c l a s s i c a l l y forbidden" s t a t e s . In the WKB treatment, the p r o b a b i l i t y for p a s s i n g through t h i s complex t r a j e c t o r y , i s r e l a t e d to the exponential o f the imaginary p a r t o f the c l a s s i c a l a c t i o n function accumulated along the complex p a t h . Recently, t h i s treatment has been g e n e r a l i z e d to i n e l a s t i c and r e a c t i v e s c a t t e r i n g (21-24). The main d i f f i c u l t y at present i n applying t h i s method, i s f i n d i n g the a c t u a l complex t r a j e c t o r i e s i n a n u m e r i c a l l y s t a b l e way. Several approaches have been suggested, and t h i s i s an a c t i v e f i e l d o f current r e s e a r c h . One should note
56
ALGORITHMS FOR C H E M I C A L
COMPUTATIONS
that the method appears also to r e q u i r e that the i n t e r a c t i o n p o t e n t i a l be an a n a l y t i c f u n c t i o n o f a l l i t s c o o r d i n a t e s , so that i t , too, can be a n a l y t i c a l l y continued. Whether a c o n t i n u a t i o n method can be a p p l i e d to a p o t e n t i a l defined by a t a b l e o f numeri c a l values and some i n t e r p o l a t i o n formulae, i s not c l e a r at present. Another p r a c t i c a l problem with the s e m i c l a s s i c a l method, i s the numerical search for t r a j e c t o r i e s with s p e c i f i c (quantized) values o f the i n i t i a l and f i n a l momenta (quantum numbers). For molecules with s e v e r a l i n t e r n a l degrees o f freedom, t h i s may be a d i f f i c u l t t a s k . Furthermore, i f there are more than several t r a j e c t o r i e s with the same i n i t i a l and f i n a l quantum numbers (as i s t y p i c a l l y the case when the t r a j e c t o r i e s are complicated), then the s e m i c l a s s i c a l r e s u l t s may not be very accurate. When c l a s s i c a l mechanics i s a p p l i e d to experiments i n v o l v i n g only one or two quantum s t a t e s , the r e s u l t s are g e n e r a l l y l e s s accurate than f o r the cases i n v o l v i n g averages over many quantum states. However, even simple correspondence p r i n c i p l e arugments, a s s i g n i n g c l a s s i c a l r e s u l t s to the quantum s t a t e o f nearest angular momentum, p r e d i c t l i n e - b r o a d e n i n g cross s e c t i o n s to an accuracy comparable to the experimental u n c e r t a i n t y (19,25-27). Moreover, by i n c l u d i n g i n t e r f e r e n c e e f f e c t s between d i f f e r e n t t r a j e c t o r i e s (28-32), one can make f a i r l y accurate p r e d i c t i o n s for e l a s t i c (28) v i b r a t i o n a l l y (33) and r o t a t i o n a l l y (34) i n e l a s t i c , and r e a c t i v e (35) s c a t t e r i n g . This i s a very u s e f u l approach, which w i l l c e r t a i n l y be used more i n future c a l c u l a t i o n s , to improve the accuracy o f c l a s s i c a l p r e d i c t i o n s . The s e m i c l a s s i c a l approach has been reviewed r e c e n t l y by Connor (36). C l a s s i c a l Path. Another approach to s c a t t e r i n g c a l c u l a t i o n s uses a quantum-mechanical d e s c r i p t i o n o f the i n t e r n a l s t a t e s , but c l a s s i c a l mechanics for the t r a n s l a t i o n a l motion. This " c l a s s i c a l path" method has been popular i n line-shape c a l c u l a t i o n s (37,38). It i s almost always f e a s i b l e to carry out such c a l c u l a t i o n s i n the p e r t u r b a t i o n approximation for the i n t e r n a l s t a t e s (37). Only r e c e n t l y have p r a c t i c a l methods been developed to perform nonp e r t u r b a t i v e c a l c u l a t i o n s i n t h i s approach (39). To get accurate r e s u l t s from t h i s approach, i t i s necessary that the c o l l i s i o n a l changes i n the i n t e r n a l energy be small compared to the t r a n s l a t i o n a l energy. Then one can a c c u r a t e l y assume a common t r a n s l a t i o n path for a l l coupled i n t e r n a l s t a t e s . In the usual a p p l i c a t i o n s o f t h i s method, one does not i n c l u d e i n t e r f e r e n c e e f f e c t s between d i f f e r e n t c l a s s i c a l paths, so that t r a n s l a t i o n a l quantum e f f e c t s , i n c l u d i n g t o t a l e l a s t i c cross s e c t i o n s , are not p r e d i c t e d . I f the p e r t u r b a t i o n approximation i s a l s o used, accuracy can be guaranteed only when the sum o f the t r a n s i t i o n p r o b a b i l i t i e s remains small throughout the c o l l i s i o n . These c l a s s i c a l path c a l c u l a t i o n s are r e l a t i v e l y easy to c a r r y out, and a n a l y t i c r e s u l t s are a v a i l a b l e i n the s t r a i g h t - l i n e p a t h , p e r t u r b a t i o n l i m i t (40). Thus when the approximations are
3.
GORDON
57
valid,
t h i s c l a s s i c a l path approach should be used.
An Algorithm f o r Choosing an Appropriate S c a t t e r i n g Theory Using the c r i t e r i a d i s c u s s e d above, we wish to s e l e c t the e a s i e s t method o f c a l c u l a t i o n which i s both f e a s i b l e to apply to the molecules o f i n t e r e s t , and whose r e s u l t s are s u f f i c i e n t l y accurate to describe the r e l e v a n t experimental r e s u l t s . W have e found i t convenient to organize t h i s s e l e c t i o n process i n t o a flow c h a r t , which i s given i n F i g . 1. S t a r t i n g at the top, one makes a sequence o f d e c i s i o n s based upon the c r i t e r i a f o r f e a s i b i l i t y and accuracy. D e c i s i o n s about the r e l a t i v e ease o f d i f f e r e n t methods are not made e x p l i c i t l y ; they are i m p l i c i t i n the o r g a n i z a t i o n o f the flow c h a r t . When one's path i n the flow chart reaches a box with no l i n e s going out from i t , and double u n d e r l i n e s at i t s bottom, one has a r r i v e d at the most s u i t a b l e method. In some cases, one's d e c i s i o n at some p o i n t may be c o n d i t i o n a l on a v a r i a b l e i n the problem. For example, t r a n s i t i o n p r o b a b i l i t i e s may be small compared to u n i t y f o r large o r b i t a l angular momenta, but not f o r small ones. In such cases one should follow both branches o f the d e c i s i o n , and a r r i v e at two d i f f e r e n t methods, one for each range o f the v a r i a b l e . In a few such cases, both branches may l a t e r r e j o i n , and only one method i s recommended a f t e r a l l . In more d i f f i c u l t cases, as many as three d i f f e r e n t methods have been found to be necessary for d i f f e r e n t ranges o f the v a r i a b l e s . Examples o f a l l these cases have been found. W f i r s t follow the flow chart for the simple case o f e l a s t i c e s c a t t e r i n g o f s t r u c t u r e l e s s atoms. The number o f i n t e r n a l s t a t e s , N , i s one, quantum s c a t t e r i n g c a l c u l a t i o n s are f e a s i b l e and recommended, f o r even the smallest modern computer. The Numerov method has often been used for such c a l c u l a t i o n s (41), but the recent method based on a n a l y t i c approximations by A i r y functions (2) obtains the same r e s u l t s with many fewer evaluations o f the p o t e n t i a l f u n c t i o n . The WKB approximation also r e q u i r e s a r e l a t i v e l y small number o f f u n c t i o n e v a l u a t i o n s , but i t s accuracy i s l i m i t e d , whereas the piecewise a n a l y t i c method (2) can o b t a i n r e s u l t s to any p r e s e t , d e s i r e d accuracy. Next we consider r o t a t i o n a l l y i n e l a s t i c s c a t t e r i n g o f H2 with He. At room temperature, the maximum r o t a t i o n a l angular momentum s t a t e which i s s i g n i f i c a n t l y populated i s j = 4. Thus we estimate N = ( j x / 2 + l ) = 9, i n c l u d i n g a l l the m-states. The data storage 8N 2 i s l e s s than 1000 numbers, only a small a d d i t i o n to the quantum s c a t t e r i n g program code (about 100 K - b y t e s ) . Assuming a m u l t i p l y time o f 1 y - s e c , 100 N i s l e s s than 0.1 sec computer time per S matrix. Thus the quantum s c a t t e r i n g c a l c u l a t i o n s are q u i t e p r a c t i c a l , and have been c a r r i e d out for more than a dozen d i f f e r e n t p o t e n t i a l surfaces (42). The r e s u l t s are i n good agreement with molecular beam r e s u l t s , sound a b s o r p t i o n , and l i n e shapes i n l i g h t s c a t t e r i n g and NMR. Because o f the wide
c m a x c ma 2 C c 3
START
Let N be the maximum number of internal states or basis functions which are coupled during collision. Do about 8 V j numbersfitinto your computer's memory, and can you / afford about IQQ.V,? multiplications on your computer, per S matrix?
yes
Calculate exact quantum scattering results by the method of piecewise analytic solutions (ref. (2)). Do the results converge as internal states are added ?
Do all the experiments you arc interpreting average over more than about 10 internal states?
yes
Accept these quantum scattering results.
Can you find complex classical trajectories which connect the quantum states of interest? yes Compute your results from analytically continued classical mechanics : (complex) trajectories (ref. (21-24))
Do real classical trajectories connect the initial and final quantum states of interest?
yes
Compute your results using these real trajectories, plus a correspondence principle, if necessary (ref. (25-27)).
Do any of the experiments of interest have angular resolution sufficient to resolve oscilla tions due to quantum interference or to observe the total elastic cross section?
yes
Try a calculation using the Distorted Wave Born Approximation (ref. ( i s . Are all the transition probabilities '| T,j \ at all radii during collision? * yes Accept the results of the Distorted Wave Born Approximation.
Are the changes in internal energy small compared to the translational lenergy?
yes
,. .
Use a fixed classical path, independent of internal states, and perturbation theory on the internal states (ref. (37)), Are all the transition probabilities ^' I Ty I ' times during the collision?
a i a 1
yes
Accept these results. Let Ni be the number of initial states of interest. Can you fit N (N +2Ni) numbers in your computer memory, and can you afibrd about 50N&N +Ni) multiplications on your computer, per S matrix ?
c c C
yes
Compute "classical" S-matriccs (ref. (31)) I with interferences between different trajectories/' Accept the results of this classical path, quantum internal states calculation.
Use afixedclassical path, independent of internal states, with an exact, nonperturbative treatment of internal states, (ref. (3 9 )) Do these results converge as internal states are added ?
Figure 1.
Flow chart for choosing an appropriate scattering theory
3.
GORDON
59
spacing o f the r o t a t i o n a l l e v e l s , and the r e l a t i v e l y weak a n g l e dependent p o t e n t i a l , these r e s u l t s converge very q u i c k l y as j a x i n c r e a s e s , and j a x = 4 i s adequate f o r a l l the experiments at temperatures up to 3 0 0 K . For c o l l i s i o n s o f H with atoms at higher e n e r g i e s , both v i b r a t i o n a l and r o t a t i o n a l e x c i t a t i o n o c c u r s . At 1 eV, about 50 channels are open. For a complete quantum s c a t t e r i n g c a l c u l a t i o n , we estimate data storage a t 8 N - 20,000 s i n g l e p r e c i s i o n words, and computer time o f 12 sec per S matrix (again assuming a 1 y-sec multiply time). Convergence i s obtained with the a d d i t i o n o f a few c l o s e d channels, and such c a l c u l a t i o n s are f e a s i b l e , and have r e c e n t l y been c a r r i e d out f o r H + He (43), and H + L i (44). For v i b r a t i o n a l and r o t a t i o n a l r e l a x a t i o n o f D at 1 eV, about 140 channels are open, so the quantum s c a t t e r i n g estimates are about 160,000 numbers i n data storage, and about 5 min comp u t i n g time per m a t r i x , o r 2 sec per i n i t i a l c o n d i t i o n . While such c a l c u l a t i o n s are f e a s i b l e on a l a r g e computer, they might be too expensive. Then, i f one i s averaging over r o t a t i o n a l s t a t e s to f i n d v i b r a t i o n a l t r a n s i t i o n p r o b a b i l i t i e s , the flow chart suggests c l a s s i c a l t r a j e c t o r i e s . However, the v i b r a t i o n a l coupling i s so weak that no r e a l t r a j e c t o r i e s connect d i f f e r e n t v i b r a t i o n a l s t a t e s , so complex t r a j e c t o r i e s must be c a l c u l a t e d to f i n d the v i b r a t i o n a l t r a n s i t i o n p r o b a b i l i t i e s (45). One should note, however, that i f one wants to f i n d a l l the i n d i v i d u a l r o t a t i o n - v i b r a t i o n t r a n s i t i o n p r o b a b i l i t i e s , the quantum c a l c u l a t i o n , at 2 sec p e r i n i t i a l c o n d i t i o n , uses l e s s computer time than the complex t r a j e c t o r y c a l c u l a t i o n , which r e q u i r e s about 2 sec per complex t r a j e c t o r y , and a search o f s e v e r a l complex t r a j e c t o r i e s f o r each i n i t i a l c o n d i t i o n . I f we c o n s i d e r the c o l l i s i o n s o f two molecules (rather than atom + molecule, as above), the number o f coupled channels i s approximately the square o f the number o f a c c e s s i b l e i n t e r n a l s t a t e s o f e i t h e r molecule s e p a r a t e l y . Thus f o r r o t a t i o n a l e x c i t a t i o n o f two hydrogen molecules near room temperature, N (j / 2 + 1)4 = 81 f o r j = 4, and quantum c a l c u l a t i o n s are feasible. However, f o r v i b r a t i o n - r o t a t i o n t r a n s i t i o n s at 1 eV, 50 i n t e r n a l s t a t e s f o r each molecule correspond to N = 2500 channels, and exact quantum c a l c u l a t i o n s are not f e a s i b l e . I f we want i n d i v i d u a l t r a n s i t i o n p r o b a b i l i t i e s f o r t h i s case, the flow chart brings us to t r y the d i s t o r t e d wave Born approximation, which i s f e a s i b l e and accurate f o r t h i s case. Next we consider some more d i f f i c u l t cases, i n which several methods are recommended f o r d i f f e r e n t p a r t s o f the c a l c u l a t i o n . For r o t a t i o n a l e x c i t a t i o n o f HC1 by Ar at room temperature, the maximum r o t a t i o n a l angular momentum quantum number coupled during c o l l i s i o n i s about 12. The maximum number o f coupled j , m s t a t e s i s N = (jmax + 1) Umax ^ heterodiatomic molecule, and thus a l l s t a t e s o f the same t o t a l p a r i t y are coupled. With 91 channels, the quantum s c a t t e r i n g c a l c u l a t i o n s are f e a s i b l e , but r a t h e r expensive. A f u r t h e r c o m p l i c a t i o n o f the
m m 2 C 2 2 2 + 2 c m a x m a x c c + 2 2 = 9 1 s i n c e H C 1 i s a
60
quantum c a l c u l a t i o n s f o r t h i s case, i s the fact that many bound s t a t e s o f HC1 + Ar e x i s t , which w i l l lead to many resonances i n the s c a t t e r i n g , and thus d i f f i c u l t energy averaging the cross sections. Thus we explore the a l t e r n a t i v e methods with the flow chart. For i n t e r p r e t i n g i n f r a r e d l i n e - w i d t h s , we average over the 2j + 1 m-states. For an i n i t i a l j greater than 5 we thus average over enough m s t a t e s so that the c l a s s i c a l method, p l u s the correspondence p r i n c i p l e , i s adequate f o r these cases. For the low-j l i n e s , we observe that i n the absence o f d i f f e r e n t i a l cross s e c t i o n measurements, we do not r e q u i r e a "high r e s o l u t i o n " quantum c a l c u l a t i o n . The r o t a t i o n a l energy changes, f o r the low j s t a t e s , are small compared to the t y p i c a l t r a n s l a t i o n a l energ i e s , so the f i x e d c l a s s i c a l path approximation i s v a l i d . For c o l l i s i o n s at large impact parameter, the c l a s s i c a l p a t h - p e r t u r b a t i o n theory r e s u l t s are o f acceptable accuracy. However, f o r small impact parameter cases, the p e r t u r b a t i o n theory f a i l s . To s e l e c t a method f o r the remaining cases we note that the maximum number o f coupled i n i t i a l s t a t e s up to j = 5 i s N = (j + 1 ) (j + 2)/2 = 21. The storage estimates f o r a n o n - p e r t u r b t i v e c l a s s i c a l path c a l c u l a t i o n are thus 91(91 + 2x21) * 21,000 numbers, and computer time 50(91) (91 + 21) 10 sec = 46 sec per S matrix. T h i s c l a s s i c a l path method i s thus f e a s i b l e f o r the remaining i n i t i a l c o n d i t i o n s , and has been used (39) to c a l c u l a t e i n f r a r e d and NMR l i n e shapes f o r t h i s system. For a heavier system, such as + Ar, a calculation of r o t a t i o n a l t r a n s i t i o n s and microwave o r i n f r a r e d l i n e widths would follow the same course through the flow c h a r t , as that followed above i n d e t a i l f o r HC1 + A r . However, at the l a s t stage (low j , small b c o l l i s i o n s ) , the number o f coupled s t a t e s would probably be too l a r g e f o r the n o n - p e r t u r b t i v e , f i x e d c l a s s i c a l path c a l c u l a t i o n to be p r a c t i c a l . Then one should c a l c u l a t e " c l a s s i c a l S matrices" i n c l u d i n g i n t e r f e r e n c e between t r a j e c t o r i e s , to cover these remaining c o l l i s i o n s .
c 2
Conclusion The theory o f molecular s c a t t e r i n g has now been developed to the p o i n t that s c a t t e r i n g c a l c u l a t i o n s can be made with an accuracy s u f f i c i e n t f o r comparison with current experiments. Thus any discrepancy between theory and experiment should be t r a c e d to an inadequate knowledge o f the i n t e r a c t i o n p o t e n t i a l s , or to experimental e r r o r s , r a t h e r than to approximations i n the c o l l i s i o n dynamics. T h i s t i g h t e r coupling o f theory and e x p e r i ment should permit a much more f r u i t f u l u t i l i z a t i o n o f the r e s u l t s of molecular beam s c a t t e r i n g . Abstract A critical d i s c u s s i o n is given o f some o f the more u s e f u l and accurate methods f o r the c a l c u l a t i o n o f cross s e c t i o n s f o r various
3.
GORDON
61
types of molecular c o l l i s i o n s . Quantum mechanical, classical and semiclassical methods are considered. Criteria are summarized for the f e a s i b i l i t y of various calculations, and for the accuracy of the results. A flow chart i s formulated, which uses these c r i t e r i a to select, for given molecules and types of experiments, the easiest calculational algorithm which yields accurate results. Examples of this selection process are given, drawn mainly from recent calculations of inelastic scattering. Acknowledgments This work was supported in part by the National Science Foundation. It is based on a paper delivered at the General Discussion on Molecular Beam Scattering, 16th - 18th of A p r i l , 1973, with some additional comments and more recent references. Literature Cited 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. For reviews of recent quantum scattering methods, see "Methods in Computational Physics," B. Alder et al., E d . , Vol. 10, Academic Press, New York, 1971. Gordon, R.G., i b i d . , Chap. 2, p. 81. Luthey, Z . , thesis (Harvard University, 1974). Rabitz, H . , J. Chem. Phys. (1972) 57, 1718. Zarur, G. and Rabitz, ., J. Chem. Phys. (1973) 59, 943. i b i d . (1974) 60, 2057. Englot, G. and Rabitz, ., Chem. Phys. (1974) 4, 458. Englot, G. and Rabitz, ., Phys. Rev. (1974) A10, 2187. Reviewed by H. Rabitz, "Modern Theoretical Chemistry III," W.H. M i l l e r , Ed. (to appear, 1976). McGuire, P . , Chem. Phys. Lett. (1973) 23, 575. Kouri, D . J . and McGuire P . , Chem. Phys. Lett. (1974) 29, 414. McGuire, P. and Kouri, D . J . , J. Chem. Phys. (1974) 60, 2488. Rosenthal, A. and Gordon, R.G., J. Chem. Phys. (1976) 64, 1621. Program No. 187, Quantum Chemistry Program Exchange, Chem i s t r y Department, Indiana University, Bloomington, Indiana 47401, U.S.A. See, for example, Rodberg, L . S . and Thaler, R.M., "Introduc tion to the Quantum Theory of Scattering," Chap. 12, Academic Press, New York, 1967. Starkschall, G. and Gordon, R.B., to be published. Cross, R . J . , Jr., and Herschbach, D.R., J. Chem. Phys. (1965) 43, 3530. Cohen, A.O. and Gordon, R.G., to be published. Pearson, R. and Gordon, R.G., to be published. Brumer, P.W. and Karplus, M . , to be published. M i l l e r , W.H. and George, T.F., J. Chem. Phys. (1972) 56, 5668. i b i d . , 5722.
62 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45.
A G RT M F R CHEMICAL C M U A I N L OIH S O O P T TO S Doll, J.D. and Miller, W.H., J. Chem. Phys. (1972) 57, 5019. Marcus, R., Kreek, H.R. and Stine, J.R., Disc. Faraday Soc. (1972). Gordon, R.G., J. Chem. Phys. (1966) 44, 3083. Gordon, R.G. and McGinnis, R.P., ibid. (1971) 55, 4898. Bunker, D.I., ref. 1, Chap. 7. Ford, K.W. and Wheeler, J . A . , Ann. Phys. (N.Y.) (1959) 7, 259. Miller, W.H., J. Chem. Phys. (1970) 53, 1949. ibid., 3578. Miller, W.H., Adv. Chem. Phys. (1974) 25, 69. Marcus, R.A., J. Chem. Phys. (1972) 57, 4903, and references therein. Miller, W.H., Chem. Phys. Lett. (1970) 7, 431. Miller, W.H., J. Chem. Phys. (1971) 54, 5386. Rankin, C.C. and Miller, W.H., J. Chem. Phys. (1971) 55, 3150. Connor, J.N.L., Chemical Society Reviews 5, 125. Anderson, P.W., Phys. Rev. (1949) 76, 647. For a recent review see Birnbaum, G., Adv. Chem. Phys., (1967) 12, 487. Neilsen, W and Gordon, R.G., J. Chem. Phys. (1973) 58, 4131. . Cross, R.J. and Gordon, R.G., J. Chem. Phys. (1966) 45, 3571. Cooley, J.W., Math. Computation (1961) 15, 363. Shafer, R. and Gordon, R.G., J. Chem. Phys. (1973) 58, 5422. Eastes, W and Secrest, D., J. Chem. Phys. (1972) 56, 640. . van den Bergh, ., David, R., Fraubel, M., Fremerey, H. and Toennies, J . P . , Disc. Faraday Soc. (1972). Miller, W.H., Disc. Faraday Soc. (1972).
4
Molecular Dynamics and Transition State Theory: The Simulation of Infrequent Events
CHARLES H . BENNETT IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598
Before the advent of the high speed d i g i t a l computer, the theoretical treatment of atomic motion was 1imited to systems whose dynamics admitted an approximate separation of the many-body problem into a n a l y t i c a l l y tractable one- or two-body problems. Two approximations were the most useful i n making this separation: 1) Stochastic approximations such as 'random walk' or 'molecular chaos', which treat the motion as a succession of simple one- or two-body events, neglecting the correlations between these events implied by the over-all deterministic dynamics. The analytical theory of gases, for example, is based on the molecular chaos assumption, i . e . the neglect of correlations betweeen consecutive c o l l i s i o n partners of the same molecule. Another example is the random walk theory of diffusion i n s o l i d s , which neglects the dynamical correlations between consecutive jumps of a diffusing l a t t i c e vacancy or interstitial. 2) the harmonic approximation, which treats atomic vibrations as a superposition of independent normal modes. This has been most successfully applied to s o l ids and free molecules at low temperatures, where the amplitude of o s c i l l a t i o n i s small enough to remain i n the neighborhood of a quadratic minimum of the potential energy. Transition state theory (1), the traditional way of calculating the frequency of infrequent dynamical events (transitions) involving a bottleneck or saddle point, t y p i c a l l y had to c a l l on both these approximations before yielding quantitative predictions. Because of the u n a v a i l a b i l i t y of a method for solving the c l a s s i c a l many-body problem d i r e c t l y , the harmonic approximation was sometimes stretched, or stochastic behavior assumed too early, i n an e f f o r t to 63
64
ALGORITHMS FOR
p r e d i c t e q u i l i b r i u m thermodynamic p r o p e r t i e s , transport c o e f f i c i e n t s , and t r a n s i t i o n rates i n systems that were too s t r o n g l y coupled and too anharmonic f o r the r e s u l t s to be r e l i a b l e . In the l a s t two decades t h i s s i t u a t i o n has been r a d i c a l l y changed by the a b i l i t y of computers t o i n t e g r a t e the c l a s s i c a l equations of motion (the c l a s s i c a l t r a j e c t o r y or molecular dynamics 'MD technique, cf r e f s . 2, 3, 4, and reviews 5, 6) f o r systems of up t o several thousand p a r t i c l e s , thereby making i t p o s s i b l e to attack by d i r e c t s i m u l a t i o n such previousl y - i n t r a c t a b l e problems as the e q u i l i b r i u m and t r a n s port p r o p e r t i e s of l i q u i d s and hot anharmonic s o l i d s , chemical r e a c t i o n s i n gases, the s t r u c t u r e of small d r o p l e t s , and conformational rearrangements i n large molecules. In a d d i t i o n to p r o v i d i n g dynamical informat i o n , the molecular dynamics method (as well as the Monte Carlo (MC) method of Metropolis e t . a K 7, 8, 6) i s r o u t i n e l y used to c a l c u l a t e e q u i l i b r i u m thermodynami c p r o p e r t i e s i n many of the same systems ( e s p e c i a l l y l i q u i d s ) , when these cannot be obtained a n a y t i c a l l y . The scope of these c l a s s i c a l s i m u l a t i o n techniques i s determined by a number of c o n s i d e r a t i o n s : 1) They are not a p p l i c a b l e to s t r o n g l y quantum mechanical systems, l i k e l i q u i d or s o l i d H or He, i n which the thermal de B r o g u e wavelength (h//27rmkT) i s comparable to the atomic dimensions; 2) They are unnecessary when the harmonic or random walk approximations are v a l i d , (e.g. i n c a l c u l a t i n g the thermodynamic p r o p e r t i e s of c o l d s o l i d s or d i l u t e gases) . 3) The p o t e n t i a l energy surface ( i . e . the p o t e n t i a l energy expressed as a f u n c t i o n of the atomic p o s i t i o n s ) on which the c l a s s i c a l t r a j e c t o r y moves i s almost a l ways semi-empirical and rather i m p r e c i s e l y known, because accurate quantum mechanical c l a c u l a t i o n s of i t are impossibly expensive except i n the simplest systems. For use i n a MD or MC program, the p o t e n t i a l energy must be rendered i n t o a form (e.g. a sum of two-body and sometimes three-body forces) that can be evaluated repeatedly at a cost of not more than a few seconds computer time per e v a l u a t i o n . 4) The methods are of course r e s t r i c t e d to simulating systems of m i c r o s c i p i c s i z e ( t y p i c a l l y between 3 and 10,000 atoms). This i s not a very serious l i m i t a t i o n because on the one hand, w i t h e x i s t i n g algorithms, simu l a t i o n cost increases only a l i t t l e f a s t e r than l i n e a r l y w i t h the number of atoms; and, on the other hand, a system of 1000 atoms or l e s s i s g e n e r a l l y large enough to reproduce most macroscopic p r o p e r t i e s of matt e r , except f o r long range f l u c t u a t i o n s near c r i t i c a l
1
4.
BENNETT
Molecular Dynamics and Transition State Theory
65
points. 5) The most serious p r a c t i c a l l i m i t a t i o n of molecular dynamics comes from i t s slowness: f o r a small (10-20 atom) system each second o f computer time s u f f i c e s t o simulate about 1 picosecond o f physical time, whereas one i s o f t e n i n t e r e s t e d i n s i m u l a t i n g phenomena t a k i n g place on a much longer time s c a l e . This problem i s not merely a matter o f e x i s t i n g computers being too s l o w indeed, 1 t o 10 picoseconds per second i s about as f a s t as one can comfortably watch an animated d i s p l a y o f molecular m o t i o n rather i t i s a m a n i f e s t a t i o n o f a common paradox i n molecular dynamics: concealment o f the d e s i r e d information by mountains of i r r e l e v a n t detail . The bulk of t h i s chapter w i l l expound a synthesis of molecular dynamics (and Monte Carlo) methods w i t h t r a n s i t i o n s t a t e theory that combines the former's freedom from questionable approximations w i t h the l a t t e r ' s a b i l i t y to p r e d i c t a r b i t r a r i l y infrequent events, events that would be p r o h i b i t i v e l y expensive t o simulate d i r e c t l y . However, before beginning t h i s exp o s i t i o n , a few more p h i l o s o p h i c a l remarks w i l l be made on the irony of being able t o simulate molecular motion a c c u r a t e l y on a picosecond time s c a l e , without thereby being able to understand the consequences of that mot i o n on a 1 second time s c a l e . To e x h i b i t the irony i n an extreme form, consider a system whose s i m u l a t i o n i s somewhat beyond the range of present molecular dynamics technique: a g l o b u l a r p r o t e i n (e.g. an enzyme) i n i t s normal aqueous environment. An animated movie o f t h i s system could not be run much f a s t e r than 10 picoseconds per second (10 psec. i s approximately the l i f e t i m e of a hydrogen bond i n water) without having the water molecules move too f a s t f o r the eye t o f o l l o w . A t t h i s r a t e , a t y p i c a l enzyme-catalyzed r e a c t i o n would take several years to watch, and the spontaneous f o l ding-up of the g l o b u l a r p r o t e i n from an extended polypeptide chain would take thousands of years. The c a l c u l a t i o n necessary t o make the movie would of course take several several orders of magnitude longer on present comput e r s ; but even i f speed of computation were not a problem, watching such a long movie would be. I t i s hard to b e l i e v e t h a t , i n order t o see how the enzyme works, o r how the p r o t e i n f o l d s up, one must view the movie i n i t s e n t i r e t y . I t i s more p l a u s i b l e that there are only a few i n t e r e s t i n g p a r t s , during which the system passes through c r i t i c a l bottlenecks i n i t s c o n f i g u r a t i o n space; the r e s t of the time being spent e x p l o r i n g l a r g e , e q u i l i b r a t e d r e s e r v o i r s between the b o t t l e n e c k s . I f the t r a j e c t o r y c a l c u l a t i o n were
66
CHEMICAL
COMPUTATIONS
repeated many times, s t a r t i n g from s l i g h t l y d i f f e r e n t i n i t i a l c o n d i t i o n s , one would expect the t r a j e c t o r y to pass through the same c r i t i c a l bottlenecks i n the same order, but the l e s s constrained p o r t i o n s of the t r a j e c t o r y , between b o t t l e n e c k s , would probably be d i f f e r e n t each time. An adequate understanding of the r e l a x a t i o n process as a whole could t h e r e f o r e be gained by gatherdynamical information on t r a j e c t o r i e s i n the neighborhood of each c r i t i c a l b o t t l e n e c k , and supplementing t h i s by a s t a t i s t i c a l c h a r a c t e r i z a t i o n ( i n terms of a f i r s t - o r d e r r a t e constant, or i t s r e c i p r o c a l , a mean residence time) of each i n t e r v e n i n g r e s e r v o i r . Before accepting the hypothesis that only a few parts of the movie would be i n t e r e s t i n g enough to c a l l f o r d e t a i l e d dynamical s i m u l a t i o n , l e t us consider the two remaining p o s s i b i l i t i e s f o r a thousand-year movie, v i z . uniforml y d u l l , and uniformly i n t e r e s t i n g . The uniformly d u l l movie would d e p i c t a slow, u n i formly-progressive r e l a x a t i o n process, l i k e the d i f f u s i o n of i m p u r i t i e s i n t o a homogeneous medium or the f a l l of sand through an hourglass. Such a r e l a x a t i o n process has no s i n g l e bottleneck (or, e q u i v a l e n t ! y , has very many small equal b o t t l e n e c k s ) , but i t i s only l i k e l y t o occur i n a system t h a t possesses some obvious s t r u c t u r a l u n i f o r m i t y ( i n the cases c i t e d , the uniformi t y of the medium i n t o which d i f f u s i o n occurs, or the u n i f o r m i t y of the sand g r a i n s ) , which would account i n a n a t u r a l way f o r the uniform r a t e of progress at d i f f e r e n t degrees of completion. More p r e c i s e l y and r e s t r i c t i v e l y , the uniform slow progress can u s u a l l y be measured by one or a few s l o w l y - r e l a x i n g , hydrodynamic degrees of freedom, whose equations of motion can be solved independently of the other degrees of freedom. In the hourglass example, the mean height of the sand i s such degree of freedom; i t s approximate equation of motion can be solved without reference to the d e t a i l e d t r a j e c t o r y , which passes through a new bottleneck i n c o n f i g u r a t i o n space every time a g r a i n of sand f a l l s through the bottleneck i n r e a l space. A movie of a such a hydrodynamic r e l a x a t i o n process has no r e a l l y e x c i t i n g p a r t s , but a l l p a r t s are more or l e s s t y p i c a l , and an understanding of the process as a whole can be gained by viewing a few p a r t s (say at the beginning, middle, and end), and i n t e r p o l a t i n g between them by the equations of motion f o r the slow degrees of freedom. The d e t a i l e d sequence of b o t t l e n e c k s e.g. the order i n which the sand g r a i n s f e l l i s not r e p r o d u c i b l e by t h i s procedure, but n e i t h e r i s i t important. The connection between molec u l a r dynamics and hydrodynamics i n uniform f l u i d s i s
1
4.
BENNETT
67
of considerable c u r r e n t i n t e r e s t (9), but i t i s p e r i pheral t o the subject of t h i s review, v i z . r e l a t i v e l y f a s t but i n f r e q u e n t events, p a r t i c u l a r l y those occurr i n g i n s p a t i a l l y nonuniform systems, whose lack of symmetry p r a c t i c a l l y guarantees that a few b o t t l e n e c k s w i l l be much harder than a l l the r e s t . Undoubtedly there are systems that s u f f e r both from b o t t l e n e c k s and slow modes, e.g. any s i z e a b l e change i n a the conformation of a p r o t e i n i n v o l v e s many atoms and i s damped by the v i s c o s i t y o f the surrounding water; thus, even i n the absence of any a c t i v a t i o n barr i e r , i t would have a r e l a x a t i o n time several orders o f magnitude longer than that of a s i n g l e water molecule. However, r e a l l y large d i s p a r i t i e s i n time s c a l e , e.g. 10 orders of magnitude i n a system o f a few thousand atoms, cannot r e s u l t from hydrodynamic modes alone, but must be due c h i e f l y t o b o t t l e n e c k s . The f i n a l p o s s i b i l i t y , a uniformly i n t e r e s t i n g movie, would have to d e p i c t a process w i t h thousands o r m i l l i o n s o f c r i t i c a l steps occuring i n a d e f i n i t e order, each step necessary t o understand the next, as i n an i n d u s t r i a l process, the f u n c t i o n i n g of a d i g i t a l computer, o r the development o f an embryo. Enzymes, having been optimized by natural s e l e c t i o n , may be expected to have somewhat complex mechanisms o f a c t i o n , perhaps w i t h several e q u a l l y important c r i t i c a l steps, but not w i t h thousands o f them. There i s reason t o b e l i e v e t h a t processes w i t h thousands of r e p r o d u c i b l e n o n - t r i v i a l steps u s u a l l y occur only i n systems that are h e l d away from thermal e q u i l i b r i u m by an external d r i v i n g f o r c e . They thus belong t o the realm of complex behavior i n continuously d i s s i p a t i v e open systems, rather than t o the realm o f r e l a x a t i o n processes i n c l o s e d systems. T r a n s i t i o n State Theory and Molecular Dynamics The idea o f c h a r a c t e r i z i n g i n f r e q u e n t events i n terms of a bottleneck o r saddle p o i n t neighborhood i s much older than the d i g i t a l computer, and indeed i s the bas i s o f t r a n s i t i o n s t a t e theory (TST), developed i n the t h i r t i e s (1) and since then a p p l i e d to a wide range of r e l a x a t i o n phenomena ranging from chemical r e a c t i o n s i n gases t o d i f f u s i o n i n s o l i d s . U n f o r t u n a t e l y , before the f e a s i b i l i t y of large s c a l e Monte C a r l o and dynamic c a l c u l a t i o n s , t r a n s i t i o n s t a t e theory could not be developed t o the p o i n t of y i e l d i n g q u a n t i t a t i v e p r e d i c t i o n s without making c e r t a i n s i m p l i f y i n g assumptions which u s u a l l y were not t h e o r e t i c a l l y j u s t i f i e d , a l though they o f t e n worked well i n p r a c t i c e . Three so-
68
CHEMICAL
COMPUTATIONS
mewhat r e l a t e d assumptions were g e n e r a l l y made: 1) t h a t the bottleneck i s an approximately quadratic p o r t i o n of the p o t e n t i a l energy surface containing a s i n g l e saddle p o i n t ( i . e . a p o i n t where the the f i r s t derivative, V ' f the p o t e n t i a l energy i s zero and where i t s second d e r i v a t i v e m a t r i x , ^ 7 V ' ^ exactly one negative eigenvalue). For t h i s (harmonic) approxi mation to be j u s t i f i e d , the n e a r l y - q u a d r a t i c p o r t i o n of the p o t e n t i a l energy surface should extend at l e a s t kT above and below the exact saddle p o i n t . 2) t h a t the t y p i c a l t r a j e c t o r y does not reverse i t s d i r e c t i o n while i n the saddle p o i n t neighborhood ( i n other words, the transmission c o e f f i c i e n t i s 100 per cent). 3) t h a t an e q u i l i b r i u m d i s t r i b u t i o n of microstates p r e v a i l s i n the saddle p o i n t neighborhood, even when the system as a whole i s i n a non-equilibrium macrost a t e , w i t h t r a j e c t o r i e s approaching the saddle p o i n t from one s i d e ('reactant') but not the other ('product ). By marrying molecular dynamics to t r a n s i t i o n s t a t e theory, these questionable assumptions can be dispensed w i t h , and one can simulate a r e l a x a t i o n process i n v o l v ing bottlenecks r i g o r o u s l y , assuming only 1) c l a s s i c a l mechanics, and 2) l o c a l e q u i l i b r i u m w i t h i n the reactant and product zones separately. For s i m p l i c i t y we w i l l f i r s t t r e a t a s i t u a t i o n i n which there i s only one bottleneck, whose l o c a t i o n i s known. L a t e r , we w i l l consider processes i n v o l v i n g many b o t t l e n e c k s , and w i l l discuss computer-assisted h e u r i s t i c methods f o r f i n d i n g bottlenecks when t h e i r l o c a t i o n s are not known a priori. The e s s e n t i a l t r i c k f o r doing dynamical simula t i o n s of infrequent events, discovered by Keck (10), i s to use s t a r t i n g points chosen from an e q u i l i b r i u m d i s t r i b u t i o n i n the bottleneck r e g i o n , and from each of these s t a r t i n g points to generate a t r a j e c t o r y by i n t e g r a t i n g Newton equations both forward and backward i n time; rather than to use s t a r t i n g points i n the reactant region and compute t r a j e c t o r i e s forward i n time, hoping f o r them to enter the bottleneck. One thus avoids wasting a l o t of time c a l c u l a t i n g t r a j e c t o r i e s that do not enter. Furthermore, although the t r a j e c t o r i e s are o r i g i n a l l y c a l c u l a t e d on the b a s i s of an e q u i l i b r i u m d i s t r i b u t i o n i n the bottleneck, t h i s d i s t r i b u t i o n can be r i g o r o u s l y c o r r e c t e d , using informa t i o n provided by the t r a j e c t o r i e s themselves, to r e f l e c t the s i t u a t i o n i n a bottleneck connecting two r e s e r v o i r s not at e q u i l i b r i u m w i t h each other. The system i n which the t r a n s i t i o n s are occuring
U U a s 1 1
4.
BENNETT
69
w i l l be assumed t o be a closed system c o n s i s t i n g o f = several t o several thousand atoms, d e s c r i b a b l e by a c l a s s i c a l Hamiltonian 3N 2 ( /2m ) + U(q ,q . . .q ) , (1) i=l i i 1 2 3N where q^ denotes the i ' t h atomic c a r t e s i a n c o o r d i nate, and mi i t s mass, and where U(q) i s the poten t i a l energy f u n c t i o n discussed e a r l i e r . The system w i l l be assumed t o have no constants of motion other than the energy: l i n e a r momentum, even i f conserved, a f f e c t s the dynamics only i n a t r i v i a l manner; w h i l e angular momentum i s not conserved i n the presence of p e r i o d i c boundary c o n d i t i o n s (these are o r d i n a r i l y used i n molecular dynamics work on condensed systems t o a b o l i s h surface e f f e c t s ) . I t i s o f t e n convenient t o d e f i n e the Hamiltonian i n terras of mass-weighted coordinates, c[ < q v m , so that the e q u i l i b r i u m v e l o c //" i t y d i s t r i b u t i o n becomes i s o t r o p i c , and the dynamics i s simply t h a t of a p a r t i c l e r o l l i n g on the p o t e n t i a l en ergy surface: q = -VU(q) .
H
The Question of E q u i l i b r i u m i n the Bottleneck. This question w i l l be discussed a t some length (see a l s o Anderson, r e f . 11), because i t has been the source of much confusion i n the past. Consider a closed sys tem whose 6N-dimensional phase space contains two r e gions a r b i t r a r i l y l a b e l l e d 'reactant and 'product', as well as a t h i r d 'bottleneck' region placed so as t o i n t e r s e c t e s s e n t i a l l y a l l t r a j e c t o r i e s passing between the other two regions.
1
Figure 1
A (reactant)
(bottleneck)
C (product)
Since , , and C are regions i n the phase space of a s i n g l e closed system, the t r a n s i t i o n s between A and C represent a unimolecular r e a c t i o n o r isomerizatTon, rather than a general r e a c t i o n i n the sense o f chemical k i n e t i c s . U n l i k e some unimolecular r e a c t i o n s , (e.g the decomposition of diatomic molecules) the molecular dy namics system of eq. 1 w i l l be assumed t o have s u f f i c i e n t l y many well-coupled degrees of freedom that t r a n s i t i o n s between reactant and product regions occur spontaneously, without outside i n t e r f e r e n c e .
70
F i r s t l e t us assume that the system has been und i s t u r b e d f o r so long that i t i s i n a macrostate of thermal e q u i l i b r i u m . T r a j e c t o r i e s w i l l then pass through the bottleneck region e q u a l l y o f t e n from l e f t to r i g h t and from r i g h t to l e f t , and the p r o b a b i l i t i e s of d i f f e r e n t microstates i n the bottleneck r e g i o n , as i n any p a r t of phase space, w i l l be given by the formulas of e q u i l i b r i u m s t a t i s t i c a l mechanics (e.g. the e q u i l i b r i u m microcanonica1 d e n s i t y , S(H( , ) -E)
E a
Peq(p,q)
,
a
(2)
Jdui (H( , )-E) f o r a system whose equations of motion conserve energy but not l i n e a r or angular momentum). In the denominator du> represents the 6N dimensional volume element (dp -dp ...dp -dq dq ...dq ). 1 2 3N 1 2 3N The e q u i l i b r i u m d i s t r i b u t i o n i n the bottleneck region i s a rigorous r e s u l t f o r any system i n macroscopic e q u i l i b r i u m and does not depend on how easy or d i f f i c u l t the bottleneck i s to e n t e r , or on how q u i c k l y the t y p i c a l t r a j e c t o r y passes through. Nevertheless, i t has seemed i n t u i t i v e l y i m p l a u s i b l e t o some s o l i d s t a t e p h y s i c i s t s (12) , who have argued that the t y p i c a l atom, i n making a d i f f u s i v e jump, u s u a l l y approaches the saddle p o i n t so q u i c k l y that the neighboring atoms (between which the jumping atom must pass) do not have time t o r e l a x outward f u l l y , as they would have, had the jumping atom been brought t o the saddle p o i n t slowl y and allowed t o e q u i l i b r a t e there. The e r r o r here i s i n regarding the jumping atom's approach s o l e l y as a cause of the outward r e l a x a t i o n , when i t may e q u a l l y well be a r e s u l t of that r e l a x a t i o n , inasmuch as p r i o r outward r e l a x a t i o n of the neighbors makes i t e a s i e r f o r the jumping atom t o pass through. The jump event i s more properly treated as a f l u c t u a t i o n i n a many-body system at thermal e q u i l i b r i u m : the jumping atom's presence i n the saddle p o i n t n e i t h e r causes, nor r e s u l t s from, but r a t h e r i s instantaneously c o r r e l a t e d w i t h , a r e l a x a t i o n i n the mean p o s i t i o n s of a l l other atoms i n the system. S i m i l a r arguments imply that the v e l o c i t y d i s t r i b u t i o n of atoms found i n the saddle p o i n t neighborhood i s thermal and Maxwellian. Although a jumping atom w i l l u s u a l l y need more-than-average k i n e t i c energy to ascend to the saddle p o i n t , a l l t h i s excess k i n e t i c energy w i l l , on the average, have been converted i n t o p o t e n t i a l energy during the ascent, only t o be recovered as k i n e t i c energy during the descent.
4.
BENNETT
71
Now consider a nonequilibrium macrostate i n which reactant and product zones are not i n e q u i l i b r i u m w i t h each other, but each by i t s e l f i s i n e q u i l i b r i u m . S t r i c t l y speaking t h i s c o n d i t i o n cannot maintain i t s e l f i f there i s any f l u x through the b o t t l e n e c k i n the long run global e q u i l i b r i u m w i l l of course be a t t a i n e d , while even i n the short run the f l u x w i l l cause departures from l o c a l e q u i l i b r i u m , s e l e c t i v e l y d e p l e t i n g some microstates i n the reactant zone and enhancing some i n the product zone. However, i f both reactant and product zones have mean residence times much longer than t h e i r i n t e r n a l r e l a x a t i o n times, t h i s s e l e c t i v e d e p l e t i o n and enhancement w i l l be n e g l i g i b l e , and the approach to global e q u i l i b r i u m w i l l take place without a s i g n i f i c a n t d e v i a t i o n from l o c a l e q u i l i b r i u m . The l o c a l e q u i l i b r i u m or 'steady-state* approximation i s j u s t i f i e d whenever the s o - c a l l e d bottleneck r e a l l y i s a bottleneck between the two regions i t connects, i n the sense o f being the c h i e f obstacle to t h e i r r a p i d equil i b r a t i o n . I f i t i s not, then the r e l a x a t i o n process being studied e i t h e r lacks a c l e a r - c u t bottleneck, o r e l s e the bottleneck has been i n c o r r e c t l y i d e n t i f i e d and the true bottleneck l i e s w i t h i n the reactant o r product zone. The lack of e q u i l i b r i u m between reactant and product zones leads to a d i s t i n c t l y nonequilibrium d i s t r i bution i n the bottleneck, but f o r t u n a t e l y i t i s one that can be expressed e a s i l y (11) i n terms of the equil i b r i u m d i s t r i b u t i o n and t r a j e c t o r y information. To do t h i s , the e q u i l i b r i u m p r o b a b i l i t y d e n s i t y Peq(p,q) i s s p l i t i n t o two nonoverlapping p a r t s , Pa(p,q) and Pc(p,q), the former o r i g i n a t i n g from an e q u i l i b r i u m d i s t r i b u t i o n i n A, the l a t t e r from an e q u i l i b r i u m d i s t r i b u t i o n i n C. For each phase point (P/2.) I f the (unique) t r a j e c t o r y through ip,q) has been i n A more r e c e n t l y than i t has been i n C, set Pa (p,q) =Peq (p,cr) and set Pc(p,g)=0. Conversely, i f the t r a j e c t o r y through (P/q) has been i n C more r e c e n t l y than i n A, set set Pc (p",q) =Peq (p,q) and set Pa (p,q) =0 . Since every phase point (except f o r u n i n t e r e s t i n g ones a c c e s s i b l e from neither A nor C) s a t i s f i e s one of the two t r a j e c t o r y conditions above and no phase point sat i s f i e s both, the two terms add up t o the e q u i l i b r i u m d e n s i t y ; on the other hand, each term separately represents the s i t u a t i o n i n which an e q u i l i b r i u m d i s t r i b u t i o n of t r a j e c t o r i e s attacks the bottleneck from one
2
72
COMPUTATIONS
side while no t r a j e c t o r i e s attack from the other s i d e . The general intermediate case, where A and C are both populated and i n t e r n a l l y a t e q u i l i b r i u m but out of e q u i l i b r i u m w i t h each other, can be expressed by saying that i f a nonequilibrium steady s t a t e ' s p r o b a b i l i t y d e n s i t y i s uniformly Xa times the e q u i l i b r i u m value i n A and uniformly Xc times the e q u i l i b r i u m value i n C, then the r e s u l t i n g d e n s i t y i n the bottleneck region w i l l be Pneq(p,c) = Xa'Pa(p,q) +Xc*Pc(p,q). (3)
Counting the T r a j e c t o r i e s . The generation of t r a j e c t o r i e s and the e s t i m a t i o n of the o v e r a l l t r a n s i t i o n r a t e are f a c i l i t a t e d by d e f i n i n g an a r b i t r a r y 6N-1 d i mensional d i v i d i n g surface S i n the bottleneck r e g i o n , and counting the t r a j e c t o r i e s as they cross through i t .
Figure 2
(reactant A )
(product C )
The forward t r a n s i t i o n r a t e constant, i . e . the number of t r a n s i t i o n s from A t o C per u n i t time and per u n i t p r o b a b i l i t y i n region A, can be expressed general l y and r i g o r o u s l y ( i . e . assuming only c l a s s i c a l mechan i c s and l o c a l e q u i l i b r i u m i n A) as / Peq(p,gJ . u (p,cj) * (u >0) (p,gj
x x
W =
(4)
du> Peq(p,q)
Here Peq, the e q u i l i b r i u m p r o b a b i l i t y d e n s i t y defined e a r l i e r , i s i n t e g r a t e d (dt*>) over the 6N dimensional reactant zone A to o b t a i n the normalizing f a c t o r i n the denominator. In the numerator, the same d e n s i t y , i s i n t e g r a t e d (dor) over the 6N-1 dimensional surface S, w i t h various weight f a c t o r s which, l i k e Peq, are f u n c t i o n s of the coordinates q[ * momenta p. The f a c t o r U L j f j i s the normal component of the v e l o c i jDC)
anc
4.
BENNETT
73
ty ( i n 6N space) of the unique t r a j e c t o r y that crosses the surface S a t p o i n t (p,cj) . I t i s included because the c r o s s i n g frequency through a surface i s p r o p o r t i o n al t o the product of l o c a l d e n s i t y and v e l o c i t y ; r e verse crossings are excluded by the f a c t o r (Uj_>0) which takes the value 1 o r 0 according t o the s i g n of u ( p , q ) . The i n t e g r a l o f the f i r s t three f a c t o r s alone thus represents the e q u i l i b r i u m forward c r o s s i n g f r e quency through the d i v i d i n g s u r f a c e , and i n e a r l y forms of t r a n s i t i o n s t a t e theory t h i s was u s u a l l y i d e n t i f i e d w i t h the forward t r a n s i t i o n r a t e . In f a c t , because of m u l t i p l e c r o s s i n g s , i t i s only an upper bound on the t r a n s i t i o n r a t e . M u l t i p l e c r o s s i n g t r a j e c t o r i e s have been found t o be s i g n i f i c a n t i n gas phase chemical r e a c t i o n s (13), and i n vacancy d i f f u s i o n i n s o l i d s (14) . To c o r r e c t f o r m u l t i p l e crossings (and, i n c i d e n t a l l y f o r nonequilibrium between reactant and product zones) Keck (10) and Anderson (11) introduced a t h i r d , trajectory-dependent f a c t o r f(^q) that causes each successful forward t r a j e c t o r y ( i . e . o r i g i n a t i n g i n A and passing through the bottleneck t o C) t o be count ed e x a c t l y once, no matter how many times i t crosses S; and causes other t r a j e c t o r i e s ( i . e . those that go from C t o A, from A t o A, o r from C t o C) not t o be counted a t a l l . Many d i f f e r e n t functions w i l l achieve t h i s purpose, f o r example Anderson's:
x
f
or Keek's:
l 0
i f the (unique) t r a j e c t o r y through (p,c[) crosses S an odd number of times, of which (p,q) i s the l a s t , otherwise;
f 1A
(Pfq) i s one of the forward crossings on a t r a j e c t o r y w i t h k forward crossings and k-1 backward c r o s s i n g s , otherwise.
if
In a d d i t i o n t o c o r r e c t i n g f o r m u l t i p l e c r o s s i n g s , the f a c t o r c o r r e c t s f o r nonequilibrium between reactant and product zones, because those parts of S not i n e q u i l i b r i u m w i t h A c o n t r i b u t e only t r a j e c t o r i e s f o r which the product (uÔ)-^ i s zero. I t i s c l e a r f o r t o p o l o g i c a l reasons that the same value of the t r a n s i t i o n r a t e w i l l be obtained regard l e s s of where the d i v i d i n g surface i s placed i n B, provided i t i n t e r s e c t s a l l successful t r a j e c t o r i e s . Nevertheless, f o r the sake of b e t t e r s t a t i s t i c s , the
74
CHEMICAL
COMPUTATIONS
d i v i d i n g surface should be chosen so as to i n t e r s e c t as few unsuccessful t r a j e c t o r i e s as p o s s i b l e . S i m i l a r l y , although the two f functions have the same mean value, Keek's appears p r e f e r a b l e because i t has a smaller var iance. For use, eq. 4 may be r e w r i t t e n i n the form of two f a c t o r s , which r e q u i r e somewhat d i f f e r e n t numerical techniques f o r t h e i r e v a l u a t i o n : S W
< UI/(U_L>0) f
>s
(5)
where <>s denotes averaging over an e q u i l i b r i u m ensem b l e on the surface S. The f i r s t or ' p r o b a b i l i t y f a c t o r ' i s e s s e n t i a l l y a r a t i o of p a r t i t i o n f u n c t i o n s , and represents the i n t e grated e q u i l i b r i u m density of phase points on S per phase point i n A. The second or ' t r a j e c t o r y - c o r r e c t e d frequency f a c t o r ' i s the number of successful forward t r a j e c t o r i e s per u n i t time and per u n i t e q u i l i b r i u m density on S. The r a t i o of t h i s to the uncorrected frequency f a c t o r <uj_ (u^X)) >s represents the number of successful forward t r a j e c t o r i e s per forward c r o s s i n g . Anderson c a l l e d t h i s r a t i o the 'conversion c o e f f i c i e n t ' to d i s t i n g u i s h i t from the 'transmission c o e f f i c e n t ' of t r a d i t i o n a l r a t e theory (1), which was u s u a l l y defined rather c a r e l e s s l y and given l i t t l e a t t e n t i o n , because i t could not be computed without t r a j e c t o r y informa tion. Usually one deals w i t h a system whose equations of motion are i n v a r i a n t under time r e v e r s a l , and the de f i n i t i o n s of the d i v i d i n g surface and reactant and pro duct regions i n v o l v e only coordinates, not momenta. Under these c o n d i t i o n s (which w i l l henceforth be as sumed) the f a c t o r u^'tuÔ) i n eqs. 4 and 5 can be replaced by i f l u ^ l , and the frequency f a c t o r (and conversion c o e f f i c i e n t ) w i l l be the same i n the forward and backward d i r e c t i o n s , because every successful f o r ward t r a j e c t o r y i s the reverse of an equiprobable suc c e s s f u l backward t r a j e c t o r y . One can then use a t h i r d form of the f u n c t i o n , v i z . 1/k 0 i f (,) i s any c r o s s i n g on a t r a j e c t o r y that makes an odd number, k, of c r o s s i n g s , otherwise. ( (6)
4.
BENNETT
75
This f u n c t i o n has the l e a s t variance of a l l I f u n c t i o n s , because i t d i s t r i b u t e s each t r a j e c t o r y ' s weight e q u a l l y among a l l i t s c r o s s i n g s . When the the a c t i v a t i o n energy i s smal1 compared to the t o t a l k i n e t i c energy, as i t i s i n most systems w i t h >100 degrees of freedom, the d i f f e r e n c e between the microcanonical ensemble and the more convenient canonical ensemble can u s u a l l y be neglected. In the canonical ensemble, the momentum i n t e g r a l s cancel out of eq. 4, making the p r o b a b i l i t y f a c t o r a simple r a t i o of c o n f i g u r a t i o n a l i n t e g r a l s . Combining t h i s with the t i m e - r e v e r s a l - i n v a r i a n t form of the frequency f a c t o r and the optimum | f u n c t i o n o f eq. 6, we get
Q* < j l u ( p , ) !(,) >s, Qa

x a
(7)
where Qa and 0 are i n t e g r a l s o f exp(-U(q)/kT) over, r e s p e c t i v e l y , the 3N dimensional reactant region and the 3N-1 dimensional d i v i d i n g surface i n c o n f i g u r a t i o n space. This exact expression f o r the t r a n s i t i o n r a t e i s the one that w i l l be used most o f t e n i n the remain der of t h i s paper. D e f i n i t i o n of a Successful T r a n s i t i o n . I t i s c l e a r that the t r a n s i t i o n r a t e depends on the boundar ies adopted f o r the bottleneck region , which a t r a j e c t o r y must traverse t o be counted as s u c c e s s f u l . I f i s made very narrow, the t r a n s i t i o n r a t e w i l l be ov erestimated, because d y n a m i c a l l y - c o r r e l a t e d m u l t i p l e crossings w i l l be counted as independent t r a n s i t i o n s ; on the other hand, i f i s enlarged t o include a l l of c o n f i g u r a t i o n space, t r a j e c t o r i e s w i l l never leave and the t r a n s i t i o n r a t e w i l l be zero. However, i f the assumed bottleneck indeed represents the c h i e f obstacle to r a p i d e q u i l i b r a t i o n between two parts of c o n f i g u r a t i o n space, there w i l l be a range s i z e s over which the t r a n s i t i o n r a t e i s nearly independent of the d e f i n i t i o n of . These 'reasonable' d e f i n i t i o n s w i l l make small enough t o exclude most of the e q u i l i b r i u m p r o b a b i l i t y , yet large enough so that a t r a j e c t o r y passing through i n e i t h e r d i r e c t i o n i s u n l i k e l y t o r e t u r n through immediately i n the opposite d i r e c t i o n . The time of r e t u r n can i t s e l f be made the c r i t e r ion o f success, by f o r g e t t i n g about the region and counting two consecutive crossings of S as independent t r a n s i t i o n s i f and only i f they are separated by a time i n t e r v a l greater than some c h a r a c t e r i s t i c time To, e.g. the a u t o c o r r e l a t i o n time of the v e l o c i t y normal t o
76
CHEMICAL
COMPUTATIONS
the d i v i d i n g surface. A successful t r a n s i t i o n , then, i s a p o r t i o n of t r a j e c t o r y that crosses S an odd number of times, at i n t e r v a l s l e s s than To, preceded and followed by c r o s s i n g - f r e e i n t e r v a l s of a t l e a s t To. This c r i t e r i o n of success emphasizes the f a c t that unl e s s the mean time between t r a n s i t i o n s i s long compared to other r e l a x a t i o n times of the system, successive t r a n s i t i o n s w i l l be c o r r e l a t e d , and the t r a n s i t i o n r a t e w i l l be somewhat i l l - d e f i n e d . Such c o r r e l a t e d t r a n s i t i o n s , representing a breakdown of the random walk hyp o t h e s i s , are s i g n i f i c a n t i n s o l i d s t a t e d i f f u s i o n (14,15), at high defect jump r a t e s . The c o r r e l a t i o n s may be i n v e s t i g a t e d e i t h e r by s i m u l a t i n g the system d i r e c t l y , without bottleneck methods, or by continuing t r a j e c t o r i e s s t a r t e d i n the bottleneck f a r enough f o r ward and backward i n time to include any other t r a n s i t i o n s c o r r e l a t e d w i t h the o r i g i n a l one. Sampling the E q u i l i b r i u m D i s t r i b u t i o n i n the Bottleneck. In order to generate r e p r e s e n t a t i v e t r a j e c t o r i e s and evaluate the corrected frequency f a c t o r , one needs a sample of the e q u i l i b r i u m d i s t r i b u t i o n Peq(p,g) on the surface S, where the t o t a l e q u i l i b r i u m p r o b a b i l i t y i s very low. For very simple systems (13) t h i s sample can be generated a n a l y t i c a l l y , but f o r anharmonic polyatomic systems i t can only be obtained n u m e r i c a l l y , by doing a molecular dynamics or Monte C a r l o machine experiment designed to sample the e q u i l i brium d i s t r i b u t i o n on S c o r r e c t l y , while g r e a t l y enhancing the system's p r o b a b i l i t y of being on or near S. This may be accomplished by a Hamiltonian of the form
H
(P'S.)
+00
* f (9.) ^ w i t h i n a small distance of S^, otherwise.
(8)
One can do dynamics under t h i s Hamiltonian by making the t r a j e c t o r y undergo an e l a s t i c r e f l e c t i o n whenever i t s t r i k e s one of the i n f i n i t e b a r r i e r s (14). Under H*, the d i f f e r e n t parts of S would be v i s i t e d with the same r e l a t i v e frequency as Tn an unconstrained e q u i l i brium machine experiment, but with a much greater absol u t e frequency; thereby a l l o w i n g a r e p r e s e n t a t i v e sample o f , say, 100 r e p r e s e n t a t i v e points on S to be as^ sembled i n a reasonable amount of computer time. I f the e q u i l i b r i u m d i s t r i b u t i o n i s canonical the momentum d i s t r i b u t i o n w i l l be Maxwel1ian and independent of coordinates; hence, r e p r e s e n t a t i v e points (Pf) be generated by t a k i n g q from an e q u i l i b r i u m Monte
c a n
4.
BENNETT
C a r l o run constrained to make moves on the 3N-1 dimen s i o n a l d i v i d i n g surface i n c o n f i g u r a t i o n space, and supplying momenta from the appropriate multidimensional Maxwell d i s t r i b u t i o n . A l t e r n a t i v e l y , the d i v i d i n g sur face may be sampled by an unconstrained Monte C a r l o run that i s encouraged to remain near by adding to the p o t e n t i a l a holding term that i s constant on S but increases r a p i d l y as c[ moves away from S. A well-chosen d i v i d i n g surface should s a t i s f y these three c r i t e r i a : 1) i t s conversion c o e f f i c i e n t should not be too s m a l l , 2) i t s d e f i n i t i o n should be simple enough to be implemented as a c o n s t r a i n t or holding term i n a MC or MD run, and 3) the a u t o c o r r e l a t i o n time of t h i s run should not be too l a r g e . I f the bottleneck technique i s to represent any saving over s t r a i g h t s i m u l a t i o n , the t o t a l machine time expended per s t a t i s t i c a l l y - i n d e p e n d e n t successful t r a n s i t i o n ( i n c l u d i n g time to generate a s t a t i s t i c a l l y - i n d e p e n d e n t s t a r t i n g p o i n t on S, time to compute the t r a j e c t o r y through i t , and overhead from unsuccessful t r a j e c t o r i e s ) must be l e s s than the mean time between spontane ous t r a n s i t i o n s i n a s t r a i g h t f o r w a r d non-bottleneck s i m u l a t i o n . O r d i n a r i l y , i f the bottleneck i s a s i n g l e compact region i n c o n f i g u r a t i o n space, i t w i l l not be d i f f i c u l t t o f i n d a d i v i d i n g surface that s a t i s f i e s a l l three c r i t e r i a . On the other hand, i f the bottleneck i s broad and d i f f u s e , containing many p a r a l l e l i n dependent channels, the only surfaces that s a t i s f y the f i r s t c r i t e r i o n may be so complicated and hard to de f i n e that they f a i l the second and t h i r d (In t h i s con n e c t i o n i t should be noted that the ' c o n t i n e n t a l d i v i d e ' o r 'watershed' between two r e s e r v o i r s , which might appear an i d e a l d i v i d i n g surface because of i t s high conversion c o e f f i c i e n t , i s not usable i n p r a c t i c e because i t i s defined by a nonlocal property of the p o t e n t i a l energy s u r f a c e ) . F i g . suggests a broad, d i f f u s e bottleneck whose watershed (dotted l i n e ) i s so broad and so contorted that no simple approximation t o i t can have a good conversion c o e f f i c i e n t .
78
CHEMICAL
COMPUTATIONS
I t i s not known whether such pathological bottlenecks occur i n p r a c t i c e . One important kind of broad bottleneck, probably not p a t h o l o g i c a l , i s found i n chemical r e a c t i o n s i n l i q u i d s o l u t i o n s ; where most of the solvent molecules are geometrically remote from, and therefore only weakl y coupled t o , the atoms immediately involved i n the t r a n s i t i o n . The remote atoms exert only a m i l d perturbing e f f e c t on the t r a n s i t i o n , and need not be i n any one c o n f i g u r a t i o n f o r the t r a n s i t i o n to occur. In other words, i f a number of t r a j e c t o r i e s f o r successful t r a n s i t i o n s were compared, a l l would pass through a s i n g l e small bottleneck i n the subspace of important nearby atoms, but the same t r a j e c t o r i e s , when projected onto the subspace of remote atoms, would not be concent r a t e d i n any one region. In the f u l l c o n f i g u r a t i o n space, the bottleneck w i l l therefore appear broad and d i f f u s e i n the d i r e c t i o n s of the weakly-coupled degrees of freedom. The obvious approach to t h i s problem i s to look for a d i v i d i n g surface i n the subspace of s t r o n g l y coupled ' p a r t i c i p a n t degrees of freedom, f o r which the bottleneck i s well l o c a l i z e d . In the d i r e c t i o n s of the weakly-coupled 'bystander' degrees of freedom, the watershed i s broad and d i f f u s e ; but one can reasonably hope t h a t p r e c i s e l y because of t h i s weak c o u p l i n g i t i s not h i g h l y contorted i n these d i r e c t i o n s , and that therefore the surface S w i l l be a good approximation to i t . Of course i t may not always be easy i d e n t i f y the p a r t i c i p a n t s and bystanders c o r r e c t l y . The problem of separating the p a r t i c i p a n t s from the bystanders has come up i n attempts to simulate d i s s o c i a t i o n of a p a i r of o p p o s i t e l y charged ions i n water (16). I f the d i v i d i n g surface i s taken to be a surface of constant distance between the two i o n s , the t r a j e c t o r y t y p i c a l l y recrosses t h i s surface many times w i t h out making noticeable progress toward d i s s o c i a t i o n or a s s o c i a t i o n . This appears to be because of a cons t r a i n i n g cage of water molecules around the i o n s , which must rearrange i t s e l f before the ions can a s s o c i ate or d i s s o c i a t e . Nevertheless, spontaneous d i s s o c i a t i o n s o c c a s i o n a l l y occur rather q u i c k l y . This suggests that i f the d i v i d i n g surface were made to depend i n the proper way on the shape of the cage, t r a n s i t i o n s through i t would be much l e s s i n d e c i s i v e . I t i s not known how many water molecules must be treated as part i c i p a n t s to achieve t h i s r e s u l t .
1
C a l c u l a t i n g the P r o b a b i l i t y Factor. The t r a n s i t i o n s generated by continuing t r a j e c t o r i e s forward and
4.
BENNETT
79
backward i n time from s t a r t i n g points on S w i l l be r e p r e s e n t a t i v e of spontaneous t r a n s i t i o n s through the b o t t l e n e c k , but the absolute t r a n s i t i o n r a t e w i l l not yet be known, because the f i r s t f a c t o r o f eqs. 5 and 7 i s not known, and cannot be computed from information c o l l e c t e d i n the bottleneck region alone. This f a c t o r i s the Boltzmann exponential of the f r e e energy d i f f e r ence (or f o r a microcanonical ensemble, entropy d i f f e r ence) between a system constrained t o the reactant r e gion and a system constrained t o the neighborhood of the d i v i d i n g surface. For very simple o r harmonic sys tems the f r e e energy d i f f e r e n c e can be c a l c u l a t e d ana l y t i c a l l y , but i n g e n e r a l , i t can only be found by s p e c i a l Monte C a r l o or molecular dynamics methods. These methods resemble the c a l o r i m e t r i c methods by which f r e e energy d i f f e r e n c e s are determined i n the l a b o r a t o r y , i n that they depend on measuring the work necessary t o conduct the system along a r e v e r s i b l e path between the two macrostates, o r between each of them and some reference macrostate of known f r e e energy. Laboratory c a l o r i m e t r y measures free energy as a func t i o n of independent s t a t e v a r i a b l e s l i k e temperature. Machine experiments are l e s s l i m i t e d : they can measure the f r e e energy change attending the i n t r o d u c t i o n o f an a r b i t r a r y c o n s t r a i n t or p e r t u r b i n g term i n the Hamiltonian. In the present case, f o r example, one could mea sure the r e v e r s i b l e work required t o squeeze the system from the r e a c t a n t zone i n t o the neighborhood of S by i n t e g r a t i n g the pressure of c o l l i s i o n s against one of the c o n s t r a i n i n g b a r r i e r s of *, as i t i s moved slowly
A l t e r n a t i v e l y , one could measure the r e v e r s i b l e work along a path between the bottleneck and one reference system (e.g. a quadratic saddle p o i n t ) , and along
80
ALGORITHMS FOR
another path between the reactant zone and a second reference system (e.g. a quadratic minimum), and sub t r a c t these. Computer c a l o r i m e t r y i s e a s i e s t to per form i n the canonical ensemble, where any d e r i v a t i v e of the f r e e energy i s equal t o the canonical average of the same d e r i v a t i v e of the Hamiltonian, measurable i n p r i n c i p l e by a Monte C a r l o run: a(A/kT)/dX = < a(H/kT)/ax > (9)
Here A i s the Helmholtz f r e e energy, i s an a r b i t r a r y parameter of the Hamiltonian, and <> denotes a canonical average. For more information about 'computer c a l o r i m e t r y ' see r e f s . 17, 18, and 19.
R e l a t i o n of Exact TST to the Harmonic Approxima t i o n . In the canonical ensemble, the most f a m i l i a r TST expression f o r the r a t e constant i s probably kT W =
Za , (10)
where Za and are dimensionless quantum or c l a s s i cal p a r t i t i o n f u n c t i o n s of the -constrained and S-constrained systems, c a l c u l a t e d w i t h respect to the same energy o r i g i n , and i s a transmission c o e f f i c i e n t . This equation i s exact and equivalent to eq. 7 i f the p a r t i t i o n f u n c t i o n s are computed c l a s s i c a l l y , and i f i s taken to be the conversion c o e f f i c i e n t , = = <|u (p,^) I - |(p,q) >s , <|u (p,q)|>s
x x
(11)
but, as w i l l be seen below, i t i s not a good quantum mechanical formula. Eq. 10 i s most f r e q u e n t l y used i n the harmonic approximation, w i t h the d i v i d i n g surface S being defined as the hyperplane perpendicular ( i n mass-weighted c o n f i g u r a t i o n space) to the unstable normal mode at the saddle p o i n t . This choice makes the conversion c o e f f i c i e n t equal to u n i t y because ( i n the harmonic approximation) al 1 normal modes move independ e n t l y ; t h e r e f o r e a t r a j e c t o r y t h a t crosses t h i s hyperplane w i t h p o s i t i v e v e l o c i t y i n the unstable mode cannot be d r i v e n back by any e x c i t a t i o n of the other modes. The p a r t i t i o n f u n c t i o n s and Za are a l s o e a s i l y evaluated i n the harmonic approximation from pro ducts of the s t a b l e normal mode frequencies at the sad-
4.
BENNETT
81
d i e point and minimum, r e s p e c t i v e l y . One thus obtains a formula expressing the t r a n s i t i o n r a t e i n terms of l o c a l p r o p e r t i e s a t two s p e c i a l points of the p o t e n t i a l energy s u r f a c e the minimum o f the reactant zone, and the saddle p o i n t i n the bottleneck: 3N 1 JL mm mil W = 3N-1 sp where , and Z^sp denote the s t a b l e normal mode frequencies, and Umin and Usp denote the p o t e n t i a l energy, a t the minimum and saddle p o i n t , r e s p e c t i v e l y . The system i s assumed t o have no t r a n s i a t i o n a l o r r o t a t i o n a l degrees of freedom. This t r a d i t i o n a l , and s t i l l very u s e f u l , form o f t r a n s i t i o n s t a t e theory i s v a l i d whenever quantum e f f e c t s are n e g l i g i b l e and the p o t e n t i a l energy surface i s quadratic f o r a v e r t i c a l distance o f several kT above and below the saddle p o i n t and minimum. Aside from assuring the accuracy of the harmonic p a r t i t i o n f u n c t i o n s , the l a t t e r c o n d i t i o n j u s t i f i e s s e t t i n g = 1 by assuring that t r a j e c t o r i e s c r o s s i n g the sad d l e p o i n t hyperplane w i l l not be r e f l e c t e d back u n t i l they have f a l l e n several kT below the saddle p o i n t energy. I n p r a c t i c e , although i t i s hard t o prove (20) , t h i s makes m u l t i p l e crossings very u n l i k e l y (21) . Much of the power o f eq. 12 comes from the e x i s tence of powerful, locally-convergent methods f o r f i n d ing energy minima and saddle p o i n t s , and methods f o r e v a l u a t i n g products of normal mode frequencies. De pending on the number of degrees o f freedom, v a r i a b l e metric (22) minimizers l i k e Harwell Subroutine VA13A or conjugate-gradient (23) minimizers l i k e VA14A con verge t o the l o c a l energy minimum much f a s t e r than the obvious method of damped molecular dynamics. Saddle p o i n t s can be found (24) s i m i l a r l y by minimizing the squared gradient IVUl * o f the energy (the s t a r t i n g p o i n t f o r t h i s minimization must be f a i r l y c l o s e t o the saddle p o i n t , otherwise i t w i l l converge t o some other l o c a l minimum of l V u | , such as an energy minimum or maximum). Once the saddle p o i n t has been found, e x i s t i n g r o u t i n e s , t a k i n g advantage of the sparseness of V V u f o r large n, are s u f f i c i e n t t o e x t r a c t the unstable mode a t the saddle point and compute the pro2 2
. exp( -(Usp-Umin) / kT ) ,
(12)
82
CHEMICAL
COMPUTATIONS
duct of s t a b l e mode frequencies ( e s s e n t i a l l y the deter minant o f V V ) even f o r systems w i t h several hundred atoms. Even when the harmonic approximation i s not quan t i t a t i v e l y j u s t i f i e d i t provides a convenient s t a r t i n g p o i n t f o r exact treatments. Thus, even i f the poten t i a l energy surface i s anharmonic i n the bottleneck, i t i s o f t e n smooth enough f o r there to be a p r i n c i p a l sad dle p o i n t that can be found by minimizing IVUl . The harmonic hyperplane through t h i s saddle point o f t e n makes a good d i v i d i n g suface, through which most c r o s s ings lead to succeed. S i m i l a r l y , the harmonic c o n f i g u r a t i o n a l i n t e g r a l on the hyperplane i s a good s t a r t i n g p o i n t f o r a c a l o r i m e t r i c Monte C a r l o determination of the exact c o n f i g u r a t i o n a l i n t e g r a l on the same hyper plane. I t may be necessary to r e s t r i c t the hyperplane l a t e r a l l y , to avoid i r r e l e v a n t p o r t i o n s of i t that may extend beyond the bottleneck r e g i o n .
U 2
Figure 5
hyperplane
The single-occupancy c o n s t r a i n t s mentioned on page 90 of r e f . 14 are an example of such l a t e r a l r e s t r i c tion) . In systems whose bottlenecks are d i f f u s e because of weakly-coupled 'bystander degrees of freedom, i t may be useful to look f o r a saddle point and harmonic hyperplane i n the subspace of s t r o n g l y coupled p a r t i c i p a n t ' degrees of freedom, e.g. by minimizing IVUl w i t h respect to the p a r t i c i p a n t s while the by standers are held f i x e d i n some t y p i c a l e q u i l i b r i u m p o s i t i o n s . In general, minimum and saddle-point seek ing r o u t i n e s w i l l be useful whenever the p o t e n t i a l en ergy surface (or i t s i n t e r s e c t i o n w i t h the subspace of p a r t i c i p a n t s ) i s s m o o t h i . e . free of numerous small w r i n k l e s and bumps of height kT or l e s s . When such roughness i s absent, the t y p i c a l bottleneck w i l l not c o n t a i n many saddle p o i n t s .
1 2 ,
Quantum C o r r e c t i o n s . The obvious way to introduce quantum c o r r e c t i o n s i n eq. 10 would be to i n t e r p r e t Za and as quantum p a r t i t i o n f u n c t i o n s ; however, t h i s neglects tunneling (, being the p a r t i t i o n f u n c t i o n of a system constrained to the top of the a c t i v a t i o n bar-
4.
BENNETT
83
r i e r , knows nothing about the b a r r i e r ' s t h i c k n e s s ) . I n the harmonic approximation tunneling can be included as a 1-dimensional p a r a b o l i c b a r r i e r c o r r e c t i o n , which has the same magnitude (but opposite sign) as the lowest-order quantum c o r r e c t i o n t o the p a r t i t i o n funct i o n of a p a r a b o l i c well of the same curvature (25, 26). This means t h a t , i n the harmonic approximation and t o lowest order i n h , the c l a s s i c a l t r a n s i t i o n r a t e i s m u l t i p l i e d by a f a c t o r depending only on the sums of squares of the normal mode frequencies a t the saddle point and minimum: W quantum =
1 /h \2[ 3N 1+ 24 VkT/ L ^ "min
3N ^ "sp
W class.
The unstable mode a t the saddle point has an imaginary frequency, and contributes n e g a t i v e l y t o the second sum, r a i s i n g the t r a n s i t i o n r a t e . When t h i s c o r r e c t i o n i s a p p l i e d t o eq. 13, one s t i l l has an expression f o r the r a t e i n terms of purely l o c a l p r o p e r t i e s a t the saddle point and minimum. The s i z e o f t h i s rather r e a d i l y - c a l c u l a t e d lowest-order c o r r e c t i o n can serve as a guide t o whether more s o p h i s t i c a t e d quantum correct i o n s are necessary. The conditions f o r v a l i d i t y of the harmonic approximation i n eq. 13 ( i . e . that the p o t e n t i a l be quadr a t i c w i t h i n a few de B r o g u e wavelengths h//27rmkT i n a l l d i r e c t i o n s from the saddlp point) are somewhat opposed t o i t s conditions o f v a l i d i t y i n eq. 12 ( i . e . that the p o t e n t i a l be quadratic w i t h i n a few kT above and below the saddle p o i n t ) , and f o r some chemical r e a c t i o n s , p a r t i c u l a r l y those i n v o l v i n g hydrogen, the harmonic approximation i s not j u s t i f i e d quantum mechani c a l l y i n the temperature range of i n t e r e s t (27) even though i t would be c l a s s i c a l l y (21). For these react i o n s , more s o p h i s t i c a t e d 1-dimensional tunneling corr e c t i o n s t o eq. 10 u s u a l l y a l s o f a i l , and i t becomes necessary t o use a method that does not assume separab i l i t y of the p o t e n t i a l i n the saddle point neighborhood . Such a method has r e c e n t l y been developed by M i l l e r , e t . a_K (28). I t uses short lengths of c l a s s i c a l t r a j e c t o r y , c a l c u l a t e d on an upside-down p o t e n t i a l energy surface, t o obtain a nonlocal c o r r e c t i o n t o the c l a s s i c a l (canonical) e q u i l i b r i u m p r o b a b i l i t y d e n s i t y Peq(p,c[) a t each p o i n t ; then uses t h i s corrected dens i t y t o evaluate the r a t e constant v i a eq. 4. The method appears t o handle the anharmonic tunneling i n the r e a c t i o n s H+HH and D+HH f a i r l y well (28), and can
84
COMPUTATIONS
be applied economically t o systems w i t h a r b i t r a r i l y many degrees of freedom. Another quantum problem, the wide spacing of v i b r a t i o n a l energy l e v e l s compared t o kT, has caused t r o u ble i n applying b o t t l n e c k methods t o simple gas phase r e a c t i o n s (29) , making them sometimes l e s s accurate than ' q u a s i c l a s s i c a l ' t r a j e c t o r y c a l c u l a t i o n s i n which t r a j c e t o r i e s are begun i n the reactant zone w i t h quan t i z e d v i b r a t i o n a l energies. This problem should be much l e s s severe i n polyatomic systems, because of the c l o s e r spacing of energy l e v e l s . Systems w i t h Many Bottlenecks.
So f a r we have considered a system with two r e s e r v o i r s separated by one bottleneck; i n general a polyatomic system w i l l have many r e s e r v o i r s i n i t s c o n f i g u r a t i o n space, and the l o c a t i o n of the c r i t i c a l bottleneck or bottlenecks w i l l be unknown. Here we w i l 1 f i r s t d i s t i n g u i s h c r i t i c a l and r a t e - l i m i t i n g bottlenecks from l e s s important ones, and then discuss several more or l e s s h e u r i s t i c methods f o r f o r f i n d i n g b o t t l e n e c k s . D e f i n i t i o n of C r i t i c a l and R a t e - L i m i t i n g B o t t l e necks" The hypothesis of l o c a l e q u i l i b r i u m w i t h i n the r e s e r v o i r s means that the s e t of t r a n s i t i o n s from r e s e r v o i r t o r e s e r v o i r can be described as a Markov pro cess without memory, w i t h the t r a n s i t i o n p r o b a b i l i t i e s given by eq. 4. Assuming the canonical ensemble and microscopic r e v e r s i b i l i t y , the r a t e constant W j i , f o r t r a n s i t i o n s from r e s e r v o i r i t o r e s e r v o i r j can be written Wji where = exp - (
kT
J /
(14)
A i = -kT In Qi i s the f r e e energy of r e s e r v o i r

x
(15) i , and (16)
B i j = B j i = -kT In (0'< |u (p,q) l-|(p,q)>s)
i s a symmetric 'free energy' of the bottleneck, w i t h and < >s being the c o n f i g u r a t i o n a l i n t e g r a l and ) e q u i l i b r i u m expectation on the d i v i d i n g surface between r e s e r v o i r s i and j (equations 15 and 16 f i x the o r i gin of the f r e e energy s c a l e by d e f i n i n g the A's and B's m i c r o s c o p i c a l l y i n terms of c o n f i g u r a t i o n a l i n t e g -
4.
BENNETT
85
r a l s ; however a c o n s i s t e n t set of A's and B's could be defined macroscopically from the W i j , by a r b i t r a r i l y s e t t i n g one o f the A's t o zero and s o l v i n g eq. 14 r e c u r s i v e l y f o r the B's and the other A ' s ) . The system of r e s e r v o i r s and bottlenecks can be represented on an ' a c t i v a t i o n energy diagram' w i t h v a l l e y - h e i g h t s given by the A's and peak-heights given by the B's.
Figure
Microscopic r e v e r s i b i l i t y o f the equations of mo t i o n i s i m p o r t a n t without i t the B i j would not be symmetric, and the r e l a t i v e occupation p r o b a b i l i t i e s of r e s e r v o i r s i and j i n the long-time l i m i t , here given by P i / P j = Wij/Wji = exp((Aj-Ai)/kT) , could no longer be expressed i n terms of l o c a l p r o p e r t i e s o f the two r e s e r v o i r s alone, but would depend on a l l paths connecting them. The a b s c i s s a i n a c t i v a t i o n energy diagrams i s a l ways somewhat a r b i t r a r y ; the o r d i n a t e , although i t can not be assigned a d e f i n i t e meaning i n the general chem i c a l - k i n e t i c s i t u a t i o n of coupled r e a c t i o n s of d i f f e r ing order (30) , has the exact meaning given i n eq. 14 when the t r a n s i t i o n s are d e f i n e d , as they are here, by a m i c r o s c o p i c a l l y r e v e r s i b l e set o f f i r s t order r a t e constants. I f there are many i n t e r c o n n e c t i n g r e s e r v o i r s , the peak and v a l l e y r e p r e s e n t a t i o n becomes i n convenient, and the system i s b e t t e r represented as an undirected graph whose v e r t i c e s are the r e s e r v o i r s and whose edges are the b o t t l e n e c k s . Since f r e e energies tend to be large compared t o kT, i t i s reasonable to assume t h a t no two r e s e r v o i r s have the same f r e e energy to w i t h i n kT, and that no two bottlenecks do e i t h e r . Under these c o n d i t i o n s , e x i t from any r e s e r v o i r i s overwhelmingly l i k e l y to occur through the lowest bottleneck leading out, and given any two r e s e r v o i r s and y , there i s a w e l l - d e f i n e d set o f r e s e r v o i r s and bottlenecks which the system w i l l probably v i s i t on i t s way from to y. This set con s i s t s simply of a l l the places t h a t would get wet i f water were poured i n t o u n t i l i t began running i n t o y ( c f . f i g . 7).
86
COMPUTATIONS
10
Figure 7
0 Y
In f i g . 1, the wet set c o n s i s t s of a l l the r e s e r v o i r s except i , and a l l the bottlenecks except j i and i y . The r e s e r v o i r y i s shown twice t o avoid having t o superimpose v i s u a l l y the two p a r a l l e l paths (x-k-l-y) and ( x - j - i - y ) that lead from t o y. The h y d r o l o g i c a l c o n s t r u c t i o n leads t o a descend ing sequence of l a k e s , each comprising a set of elemen t a r y r e s e r v o i r s that reach a common l o c a l e q u i l i b r i u m and look from the outside l i k e a s i n g l e r e s e r v o i r . The mean residence time f o r a lake i s the p o s i t i v e exponen t i a l of i t s depth (the depth of a compound lake i s sim p l y the depth of i t s deepest p a r t , compared t o which a l l other parts are n e g l i g i b l e , because of the r u l e that the A s t y p i c a l l y d i f f e r by more than kT). Of the bottlenecks that are v i s i t e d , the submerged ones l i k e x j and k l are t y p i c a l l y v i s i t e d many times, and have hardly any i n f l u e n c e on the mean time required to get from t o y. The c r i t i c a l bottlenecks are ones l i k e xk and l y that stand a t the s p i l l w a y s of lakes. The system t y p i c a l l y passes through each c r i t i c a l bottleneck e x a c t l y once. One of the c r i t i c a l b o t t lenecks, the one w i t h the deepest lake behind i t , i s r a t e - 1 i m i t i n g : most of the time i s spent w a i t i n g i n that 1ake. ( I t i s sometimes wrongly supposed that the highest b o t t l e n e c k , here xk, i s r a t e - l i m i t i n g ; i n f a c t bottleneck l y i s , because i t s lake i s deeper than that behind xk. The highest bottleneck i s thus 'path-determining' without n e c e s s a r i l y being r a t e - l i m i t i n g . The complicated r e l a t i o n among r a t e s and b o t t lenecks i s shown by the f a c t that i f bottleneck xk were r a i s e d , so that the j x lake overflowed t o the l e f t instead of t o the r i g h t , the mean time t o pass from t o y would a c t u a l l y be decreased, because the deepest lake would have a depth of four instead of five.)
1
4.
BENNETT
87
F i n d i n g the Bottlenecks. In order to c a r r y out the h y d r o l o g i c a l c o n s t r u c t i o n , one must be a b l e , given a r e s e r v o i r i and a l i s t of the lowest b o t t l e necks leading out o f i t , t o f i n d the next lowest, i t s t r a n s i t i o n r a t e W j i and the new r e s e r v o i r j t h a t i t leads t o . A s t r a i g h t f o r w a r d MD or MC s i m u l a t i o n would e v e n t u a l l y f i n d a l l the r e l e v a n t bottlenecks and r e s e r v o i r s , but only a t the cost of w a i t i n g thousands o f years i n the deep l a k e s , which i s p r e c i s e l y what we are t r y i n g to avoid. There are several ways o f g e t t i n g a polyatomic system to escape from one l o c a l minimum o r r e s e r v o i r i n t o another; but unfortunately none of them can be t r u s t e d to escape v i a the bottleneck of lowest f r e e energy, as i s required f o r the h y d r o l o g i c a l con s t r u c t i o n . Therefore they must be used rather conser v a t i v e l y , i n an attempt to g r a d u a l l y d i s c o v e r and f i l l out the unknown graph of r e s e r v o i r s and t r a n s i t i o n r a t e s , without missing any important bottleneck. Escape methods are most powerful when used i n con nection with s t a t i c energy minimization and saddle-point f i n d i n g r o u t i n e s , i n an e f f o r t t o c a t a logue a l l the r e l e v a n t saddle points and minima on the p o t e n t i a l energy surface. This approach should be used whenever the p o t e n t i a l energy surface i s smooth on a s c a l e o f kT, so that the t y p i c a l b a r r i e r height bet ween adjacent l o c a l minima i s high enough to j u s t i f y t r e a t i n g each l o c a l minimum as a separate r e s e r v o i r and each saddle point as a separate bottleneck. The main s t a t i c methods o f escape are 1) systematic search, 2) i n t u i t i o n , 3) normal mode t h e r m a l i z a t i o n , and 4) pushing . 1) Systematic search of the neighborhood. This i s p r a c t i c a l only i f the search i s conducted i n a subspace of low d i m e n s i o n a l i t y , because the number of mesh points required grows e x p o n e n t i a l l y with the dimension ality. I t i s u s u a l l y a d v i s a b l e , a t each mesh p o i n t , t o l o c a l l y minimize the energy w i t h respect to a l l the degrees of freedom not being searched. This i s c a l l e d 'adiabatic mapping'. A systematic search w i t h s u f f i c i e n t l y f i n e mesh i n a l l r e l e v a n t degrees o f freedom w i l l indeed l o c a t e a l l saddle points and minima i n a given neighborhood, but i t i s u s u a l l y p r o h i b i t i v e l y expensive. Methods that do not search everywhere are i n p r i n c i p l e u n r e l i a b l e because i t i s p o s s i b l e f o r a saddle p o i n t or minimum on the p o t e n t i a l energy surface to be so sharply l o c a l i z e d that i t i s undetectable a short distance away. (This may seem to c o n t r a d i c t the notion that since every saddle p o i n t has a unique 1-di mensional g u l l y or steepest-descent path connecting i t
1
88
CHEMICAL
COMPUTATIONS
to each of two minima, i t ought to be p o s s i b l e to f o l low the g u l l i e s from minimum to saddle point to minimum a l l over the p o t e n t i a l energy surface. Unfortunately, n e i t h e r these g u l l i e s nor the 3N-1 dimensional watersheds between adjacent minima are l o c a l l y - d e f i n a b l e p r o p e r t i e s of the p o t e n t i a l energy s u r f a c e ) . 2) I n t u t i o n : c o n s i d e r a t i o n s of symmetry and common sense (aided perhaps by model b u i l d i n g ) o f t e n make the approximate l o c a t i o n s of the r e l e v a n t minima and saddle points obvious. 3) L e v i t t and Warshel (31,32) have used an method c a l l e d 'normal mode t h e r m a l i z a t i o n ' to simulate s t a t i c a l l y the e f f e c t of heating to a temperature above the b a r r i e r height between adjacent minima. S t a r t i n g at a one l o c a l minimum, the system i s d i s p l a c e d along each normal mode by an amount that would correspond t o kT energy r i s e on the l o c a l quadratic approximation to the p o t e n t i a l energy surface; however, the 'temperature' used i s so high that on the r e a l p o t e n t i a l energy surface the system i s d i s p l a c e d out of i t s o r i g i n a l watershed, and subsequent energy minimization leads to a new l o c a l minimum, from which the whole process can be repeated. L i k e e x p l i c i t heating, t h i s method preferent i a l l y d i s p l a c e s the system i n the easy d i r e c t i o n s i . e . along the s o f t e r normal modes which are l e s s l i k e l y t o produce immediate atom-atom overlaps. 4) 'Pushing'. This c o n s i s t s minimizing the energy of a system i n which the o r i g i n a l minimum has been des t a b i l i z e d by an a r t i f i c i a l perturbing term i n the pot e n t i a l energy. Such pushing p o t e n t i a l s have been used i n energy m i n i m i z a t i o n studies on p r o t e i n s by Gibson and Scheraga (33) and by L e v i t t (32), and are q u i t e s i m i l a r i n s p i r i t to the methods used by T o r r i e and V a l l e a u (19) to push Monte C a r l o systems i n t o d e s i r e d regions of c o n f i g u r a t i o n space. In the case of energy m i n i m i z a t i o n , the goal of the added term should be to make what was a l o c a l m i n i mum f l a t , or s l i g h t l y convex, thus causing the system to r o l l away to another minimum. The obvious term to do t h i s i s a paraboloidal mound complementary i n shape to the harmonic neighborhood of the l o c a l minimum: U'(q) = U(q) = +Upush(q), where (17)
Upush(q)
- (q-qmin)-VVU (qmin) (q-qmin) ,
w i t h qmin denoting the coordinates of the minimum. One may a l s o define a s p h e r i c a l l y symmetric pushing potential,
4.
BENNETT
Molecular Dynamics and
Transition State Theory
89
Upush(q)
= -const
I(q-qmin)I
(18)
The p o t e n t i a l o f eq. 17 pushes the system away from the o r i g i n a l minimum i n the d i r e c t i o n s o f n e g a t i v e d e v i a t i o n from h a r m o n i c i t y . The s p h e r i c a l l y symmetric pot e n t i a l o f eq. 18 pushes the system away p r e f e r e n t i a l l y a l o n g the d i r e c t i o n s o f low c u r v a t u r e . The pushing p o t e n t i a l s used by L e v i t t (32) were o f the symmetric type and i n c o r p o r a t e d a smooth c u t o f f a t a range o f s e v e r a l atomic d i a m e t e r s ; t h i s i s a v o i d s h a v i n g the pushing p o t e n t i a l dominate the energy a t l a r g e d i s t a n c e , s e r i o u s l y d i s t o r t i n g any new minimum the system escapes i n t o . More r e c e n t l y (34) L e v i t t has used unsymmetrical pushing p o t e n t i a l s . S i n c e pushing p o t e n t i a l s a r e not guaranteed always t o escape v i a the lowest s a d d l e p o i n t , i t would be wise to use them s y s t e m a t i c a l l y i n an e f f o r t t o f i n d a l l the easy escapes from the g i v e n i n i t i a l minimum. T h i s can be done by r e p e a t i n g the escape m i n i m i z a t i o n s e v e r a l t i m e s , each time adding t o the p o t e n t i a l a s h o r t - r a n g e d r e p u l s i v e term p l a c e d so as t o o b s t r u c t the p e r v i o u s escape r o u t e . Having escaped from one l o c a l minimum t o an a d j a c e n t one, the next t a s k i s to f i n d the s a d d l e p o i n t , choose a good d i v i d i n g s u r f a c e and c a l c u l a t e the t r a n sition probabilities Wij and W j i . I f escape was a c h i e v e d by p u s h i n g , the escape path t y p i c a l l y passes through the b o t t l e n e c k r e g i o n , and the h i g h e s t p o i n t ( i . e . the p o i n t h a v i n g h i g h e s t u n p e r t u r b e d energy) on t h i s path i s o f t e n c l o s e enough t o the s a d d l e p o i n t t o s e r v e as a s t a r t i n g p o i n t f o r a l o c a l l y convergent mini m i z a t i o n of I Vu V , t o f i n d the s a d d l e p o i n t . Once the s a d d l e p o i n t has been found, the u n s t a b l e mode and p e r p e n d i c u l a r h y p e r p l a n e may be c o n s t r u c t e d i n the usua l manner. I f no escape p a t h i s a v a i l a b l e (e.g. i f the second minimum were known a p r i o r i by reasons o f symmetry or i f i t were found by a s y s t e m a t i c s e a r c h ) , an escape path can be g e n e r a t e d by the ' p u s h - p u l l method. T h i s i s l i k e p u s h i n g , except t h a t i t supplements the pushing p o t e n t i a l i n the minimum one wishes t o l e a v e w i t h an a t t r a c t i v e ' p u l l i n g ' p o t e n t i a l i n the minimum one w i s h es t o e n t e r . The s t r e n g t h s and ranges o f t h e s e p o t e n t i a l s a r e g r a d u a l l y i n c r e a s e d u n t i l the d e s i r e d t r a n s i t i o n occurs. T e s t s o f the p u s h - p u l l method (35) on c h i r a l i t y r e v e r s a l o f a 10-atom model polymer showed i t s u p e r i o r t o the common method o f one-dimensional c o n s t r a i n e d m i n i m i z a t i o n , which d i d not come c l o s e enough t o the s a d d l e p o i n t t o b e g i n a convergent minimi z a t i o n o f the squared g r a d i e n t . The p i t f a l l s o f sad1 1
90
COMPUTATIONS
d i e - p o i n t f i n d i n g methods based on constrained minimiz a t i o n have been noted by Mclver and Komornicki (24) and Dewar and Kirschner (36). When the p o t e n t i a l energy surface i s rough on the s c a l e of kT, so that l o c a l minima are very numerous and separated by b a r r i e r s of height kT o r l e s s , energy minimization methods are not very h e l p f u l , and i t becomes necessary t o use escape methods that w i l l enable a f i n i t e - t e m p e r a t u r e MC o r MD system t o escape from a r e s e r v o i r containing many l o c a l minima, through a bottleneck perhaps c o n t a i n i n g many saddle p o i n t s . Aside from i n t u i t i o n , there are two basic methods: 1) heating, and 2) pushing. 1) H e a t i n g a MC o r MD system can always be induced to leave a r e s t r i c t e d region i n c o n f i g u r a t i o n space by r a i s i n g i t s temperature o r equivalent!y by a r b i t r a r i l y making the atoms smaller or s o f t e r . Heating has the disadvantage of f a v o r i n g escape v i a a wide bottleneck regardless of i t s height on the p o t e n t i a l energy surface; t h i s may not be the bottleneck having lowest f r e e energy a t the temperature of i n t e r e s t . 2) Pushing can be best be a p p l i e d t o a MC and MD system i f one has i n mind a r e a c t i o n coordinate, r(q) , i . e . some f u n c t i o n of the coordinates q t h a t , because i t takes on a rather l i m i t e d range of values, suggests that the system i s trapped i n a rather l i m i t e d part of c o n f i g u r a t i o n space. A Monte C a r l o run under the unperturbed p o t e n t i a l U would y i e l d a f a i r l y narrow d i s t r i b u t i o n of values of r , representable as a histogram, h ( r ) :
Figure 8
Mr) r= 10 Suppose one i s i n t e r e s t e d (as T o r r i e and V a i l eau were) i n the e q u i l i b r i u m p r o b a b i l i t y of an r v a l u e , say r=30, outside the observed range; a l t e r n a t i v e l y , one may suspect that p ( r ) , the true e q u i l i b r i u m d i s t r i b u t i o n of r , i s bimodal, w i t h another peak around r=40, but that a bottleneck around r=30 i s preventing t h i s peak from being populated. In order t o push the l o c a l e q u i l i b r i u m ensemble out of the range r=15-25, i t s u f f i c e s t o perform a Monte C a r l o run under the p o t e n t i a l
4.
BENNETT
91
= U + Upush,
with (19)
Upush(q) = +kT In f ( r ( q ) ) ,
where f ( r ) i s an a l w a y s - p o s i t i v e f u n c t i o n chosen t o approximate the histogram h(r) i n the range 15-25 where data have been c o l l e c t e d and to be a reasonable e x t r a p o l a t i o n of h(r) i n the region where data are d e s i r e d but none have been c o l l e c t e d . I t i s easy to show that the e q u i l i b r i u m d i s t r i b u t i o n under the perturbed p o t e n t i a l U i s related to that under U by
1
p'(r)
= p(r)
Q 1 , 0' f ( r )
1
(20)
where Q and Q' denote the two systems c o n f i g u r a t i o n a l i n t e g r a l s . The histogram h obtained under U' w i l l thus be approximately f l a t where h was peaked, and w i l l extend a t l e a s t s l i g h t l y i n t o the range not v i s i t ed by h.
1
Figure 9
h' (r) r= 10 20 30~* 40 50
I f there i s a bottleneck at r=30, the system i s much more l i k e l y t o f i n d i t and suddenly leak through; i f not, one has a l e a s t measured the e q u i l i b r i u m d i s t r i b u t i o n of r i n a region where i t would be too low t o measure d i r e c t l y . The normalizing f a c t o r Q/Q', neces sary to make the connection between and p', can found be from the histograms v i a eq. 20 o r , more accu r a t e l y , by eqs. 12a and 12b of reference 17. I f the system suddenly and i r r e v e r s i b l y leaks i n t o the region around r=4 0, i n d i c a t i n g a bottleneck, the f u n c t i o n f ( r ) should be r e v i s e d t o f l a t t e n out both peaks o f the bimodal d i s t r i b u t i o n , and produce an ap proximately uniform d i s t r i b u t i o n over the whole range r=20 t o 40. Sampling t h i s f1attened-out ensemble serves two purposes: 1) I t allows a r e p r e s e n t a t i v e sample of c o n f i g u r a t i o n s on the d i v i d i n g surface t o be c o l l e c t e d i n a r e a sonable amount of computer time (the d i v i d i n g surface i s conveniently defined by r(q) = rmin, where rmin i s the minimum of the bimodal d i s t r i b u t i o n o f p ( r )
92
CHEMICAL
COMPUTATIONS
that would obtain under the unperturbed p o t e n t i a l U.) From these, t r a j e c t o r i e s can be c a l c u l a t e d i n the usual way, to o b t a i n the second f a c t o r of eq. 7. 2) By v i r t u e of the known r e l a t i o n (eq. 20) between and , i t e s t a b l i s h e s a c a l o r i m e t r i c path connecting the reactant region w i t h the b o t t l e n e c k , a l l o w i n g the f i r s t f a c t o r i n eq. 7 to be c a l c u l a t e d . The d e f i n i t i o n of the reactant coordinate used i n the MC pushing method may be derived from a separation i n t o ' p a r t i c i p a n t * and 'bystander' degrees of freedom, or i t may be a r r i v e d at i n t u i t i v e l y or e m p i r i c a l l y . Generally speaking, the more c l e a n l y a r e a c t i o n c o o r d i nate separates the two peaks of a bimodal d i s t r i b u t i o n , the higher the conversion c o e f f i c i e n t that can be ac hieved w i t h i t . Speeding up the Sampling of C o n f i g u r a t i o n Space. Bottleneck methods allow infrequent events to be simu l a t e d w i t h very l i t t l e e x p l i c i t dynamical c a l c u l a t i o n , since the t r a j e c t o r y only needs to be followed forward and backward u n t i l i t leaves the bottleneck. On the other hand, p a r t i c u l a r l y f o r s t r o n g l y anharmonic sys tems, they demand a great deal of MC or MD sampling of constrained or biased e q u i l i b r i u m ensembles, v i z . the ensemble on the d i v i d i n g s u r f a c e , the ensemble i n the reactant zone, and perhaps several c a l o r i m e t r i c i n termediates needed to compute the r a t i o of c o n f i g u r a t i o n a l i n t e g r a l s , Q+/Qa. I t i s important to be able to sample these ensembles e f f i c i e n t l y , i . e . without expending too much computer time per s t a t i s t i c a l l y - i n dependent sample p o i n t . This s e c t i o n discusses several curable kinds of slowness commonly encountered i n equi l i b r i u m sampling. The simplest kind of slowness, and perhaps the most s e r i o u s , i s due to an unrecognized bottleneck w i t h i n the one of the e q u i l i b r i u m ensem b l e s . I f the unrecognized bottleneck i s f a i r l y easy to pass through, i t w i l l only increase the a u t o c o r r e l a t i o n time of the run sampling the ensemble; i f i t i s hard, i t w i l l lead to a completely erroneous sample. The cure i s to f i n d the bottleneck and t r e a t i t e x p l i c i t ly. Another kind of slowness comes from the approxi mately 1000-fold d i s p a r i t y between bonded and nonbonded forces among atoms. This means that a t y p i c a l covalent bond undergoes about 30 small-amp1itude, nearly-harmon i c v i b r a t i o n s i n the time required f o r any other s i g n i f i c a n t molecular motion to take place. In doing dynam i c s c a l c u l a t i o n s , these f a s t v i b r a t i o n a l modes are a nuisance because they f o r c e the use of a very short time step, about .001 psec. or l e s s . F o r t u n a t e l y , they
4.
BENNETT
93
can be gotten r i d of i n e i t h e r o f two ways: 1) they can be a r t i f i c i a l l y slowed down (without a f f e c t i n g the e q u i l i b r i u m s t a t i s t i c a l p r o p e r t i e s of the system) by, i n e f f e c t , g i v i n g them e x t r a mass (37); 2) they can be frozen out e n t i r e l y by i n c o r p o r a t i n g c o n s t r a i n t s on bond distances and angles i n the equations o f motion. I t was only r e c e n t l y recognized (38) t h a t such constr a i n t s , even when a p p l i e d to a large number o f bonds simultaneously, need not appreciably increase the machine time required t o do one i n t e g r a t i o n step. Of course the mass-modified system does not have the same dynamics as the o r i g i n a l system, and the rigid-bond system has n e i t h e r the same dynamics nor the same s t a t i s t i c a l p r o p e r t i e s ; however, accurate dynamics i s needed only i n the b o t t l e n e c k s c o r r e c t s t a t i s t i c a l p r o p e r t i e s are s u f f i c i e n t elsewhere. In view of the near-harmonicity of the bonded v i b r a t i o n s , i t i s probable t h a t t h e i r e f f e c t on the s t a t i s t i c a l p r o p e r t i e s could be computed as a p e r t u r b a t i o n t o the s t a t i s t i c a l p r o p e r t i e s of a rigid-bond system. A t h i r d kind of slowness, that due to hydrodynamic modes, has been discussed already. I t i s d i f f i c u l t t o do anything about these slow c o l l e c t i v e modes, but f o r t u n a t e l y they cannot c o s t very many orders of magnitude i n a system of a few thousand atoms or l e s s . A f i n a l kind o f slowness i s that which sometimes a r i s e s (39, 17) i n Monte C a r l o sampling under a biased p o t e n t i a l of the form of eq. 19. Sometimes these runs e x h i b i t discouraging1y long a u t o c o r r e l a t i o n times f o r d i f f u s i o n of the r e a c t i o n coordinate back and f o r t h along i t s a r t i f i c a l l y broadened spectrum. The reason f o r t h i s i s not always c l e a r , but sometimes i t may be due to a strong gradient o f energy and entropy p a r a l l e l to the r e a c t i o n coordinate, so that one end of the spectrum represents a s m a l l , low-energy region of con f i g u r a t i o n space while the other end represents a large region of uniform, moderately-high energy. O r d i nary Monte C a r l o t r a n s i t i o n algorithms (8), which make t r i a l moves symmetrically i n c o n f i g u r a t i o n space and then accept o r r e j e c t them according to an energy c r i t e r i o n , cannot move very e f f i c i e n t l y i n such a grad i e n t , because most t r i a l moves are made i n the d i r e c t i o n o f i n c r e a s i n g entropy, only then to be r e j e c t e d f o r r a i s i n g the energy. This problem might be ameliorated by using an unsymmetrical Monte C a r l o t r a n s i t i o n algorithm, one that made t r i a l moves more o f t e n i n d i r e c t i o n s suspected o f leading toward the s m a l l , low-energy r e g i o n , and compensated f o r t h i s b i a s by g i v i n g a one-way energy reward to moves i n the opposite d i r e c tion.
94 Summary.
CHEMICAL
COMPUTATIONS
Some phenomena o c c u r r i n g i n systems of 3 to 10,000 atoms are so infrequent that they would take thousands of years to simulate on a computer. Such long time phenomena (many orders of magnitude longer than the microscopic system's longest hydrodynamic r e l a x a t i o n time) i n v o l v e a bottleneck or a c t i v a t i o n b a r r i e r , which, i f i t can be discovered, can be used to speed up the s i m u l a t i o n by many orders of magnitude. The machinery f o r doing t h i s c o n s i s t s of t r a n s i t i o n s t a t e theory supplemented by c l a s s i c a l t r a j e c t o r y c a l c u l a t i o n s to c o r r e c t f o r m u l t i p l e crossings and by ' c a l o r i m e t r i c ' Monte C a r l o methods to evaluate a n a l y t i c a l l y intractable p a r t i t i o n functions. Before the development of the d i g i t a l computer, the main weakness of t r a n s i t i o n s t a t e theory was i t s dependence on the harmonic approximation; now i t s main weakness, and i t s main p o t e n t i a l f o r f u t u r e improvement, i s i n algorithms f o r f i n d i n g b o t t l e n e c k s . When the energy surface i s smooth on a s c a l e of kT, bottlenecks can be i d e n t i f i e d w i t h saddle p o i n t s , and the need i s f o r an algorithm t h a t , given a potent i a l minimum, w i l l f i n d a l l the reasonably low saddle p o i n t s leading out of i t . E x i s t i n g algorithms are u n r e l i a b l e i n p r i n c i p l e (because a saddle p o i n t may be i n v i s i b l e a short distance away), but may be r e l i a b l e i n p r a c t i c e . More e m p i r i c a l t e s t i n g of them i s needed. When the p o t e n t i a l energy i s rough on a s c a l e of kT, saddle p o i n t s (and t h e i r convenient unstable-mode hyperplanes) are no longer a good guide, and the job of s e l e c t i n g the r e a c t i o n coordinate and d i v i d i n g surface becomes much more a r b i t r a r y and e m p i r i c a l . An import a n t and poorly-understood intermediate case i s a pot e n t i a l energy surface that i s smooth i n some d i r e c t i o n s (the ' p a r t i c i p a n t ' degrees of freedom) and rough i n other d i r e c t i o n s (the 'bystander' degrees of f r e e dom) . Table I . o u t l i n e s the steps f o r f i n d i n g the b o t t leneck, e v a l u a t i n g the r a t e constant, and generating t y p i c a l t r a j e c t o r i e s f o r infrequent events.
Table I .
Flowchart f o r Bottleneck S i m u l a t i o n of Infrequent Events.
Flowchart Step Smooth (b) Equilibrium MC o r MD run Find sad. p t . and d e f i n e S as _ hyperpTane L i n subspace of participants, w i t h bystanders clamped Smooth only f o r p a r t i c i p a n t s (c) Rough f o r all Equilibrium MC or MD run MC Push using empirical r e a c t i o n coordinate r(q) Define 1 by 5 r(q) = rmin.
Method t o Use Depends on Smoothness of Pot. Energy Surface
otenti EnergyN (PSurfacea l U(g) J
Harmonic (a)
Characterize Reactant Zone A

2
Local minimum of the p o t e n t i a l energy U
Escape from A and f i n d bottle-| neck o r saddle point
S t a t i c Push t o escape, Minimize |VU I from max U on escape path to f i n d saddle p o i n t .
Choose d i v i d i n g Surface S
S = hyperplane J_ t o unstable normal mode
Sample S ensemble
Not necessary
E q u i l i b r i u m MC run on or near the surface S
VJCalc. Q*/Qa
Cale. < I U J J ' >s
MC C a l o r i m e t r y between A and S ensembles Normal mode f r e q s . a t s.p. MD forw. & backward i n time from (p,q) on S and minimum (Representative T r a j e c t o r i e s through B o t t l e n e c k )
1
Rate Const.
(a) U i s quadratic w i t h i n kT above and below l o c a l minima and saddle p o i n t s . (b) U i s not q u a d r a t i c , but i s s t i l l 'smooth on a s c a l e of kT, so t h a t adjacent l o c a l minima are t y p i c a l l y separated by b a r r i e r s higher than kT. (c) U smooth w i t h respect t o some degrees o f freedom, the ' p a r t i c i p a n t s ' , but rough on a s c a l e of kT w i t h respect t o o t h e r s , the 'bystanders'. (d) Rate constant W i s obtained by eq. 12. i f U i s harmonic, otherwise by eq. 7.
96 Acknowledgements
A G R T M F R CHEMICAL C M U A I N LOIHS O OPTTOS
I wish to thank Phil Wolfe and Michael L e v i t t f o r valuable dicussions of minimization and escape methods, and Aneesur Rahman f o r repeatedly drawing my attention to p r a c t i c a l cases of i n t o l e r a b l e slowness i n molecular dynamics. Some of the work was done at the Centre Europen de Calcul Atomique et Molculaire, Orsay, France. L i t e r a t u r e Cited 1. Glasstone, S., L a i d l e r , K.J., and Eyring, H., "Theory of Rate Processes" McGraw-Hil1, New York, 1941 2. Alder, B. J . , and Wainwright, . ., J . Chem. Phys. (1959) 31, 459 3. V e r l e t , L., Phys. Rev. (1967) 159, 98 4. Rahman, A. and Stillinger, F.H., J . Chem. Phys. (1971) 55, 3336 5. Bunker, D.L., Methods Comp. Phys. (1971) 10, 287 6. Wood, W W , and Erpenbeck, J . J . , Ann. Rev. Phys. .. Chem. (1976) 27 7. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., T e l l e r , A.H., and T e l l e r , E., J , Chem. Phys. (1953) 21, 1087 8. Wood, W. W. i n "Physics of Simple Liquids" (ed. H.N.V. Temperley, J . S. Rowlinson, and G.S. Rush brooke) pp. 115-230, North-Holland, Amsterdam, 1968 9. Wood, W. W , i n "Fundamental Problems i n . Statisti cal Mechanics I I I " (ed. E.G.D. Cohen) pp. 331-338, North-Holland, Amsterdam, 1974 10. Keck, J . C., Discuss. Faraday Soc. (1962) 33, 173 11. Anderson, J . B., J . Chem. Phys. (1973) 58, 4684 12. Wynblatt, P. J . Phys. Chem. Solids (1968) 29, 215 13. J a f f e , R.L., Henry, J.M. and Anderson, J.B, J . Chem. Phys. (1973) 59, 1128 14. Bennett, C. H., i n "Diffusion i n S o l i d s : Recent Developments" ( J . J . Burton and A. S. Nowick, ed.), pp. 73-113, Academic Press, New York, 1975 15. Bennett, C.H., 19'e Colloque de Mtallurgie, Commissariat a l'Energie Atomique, Saclay, France 2225 June 1976 16. Rahman, ., unpublished r e s u l t s 17. Bennett, C.H., J . Comp. Phys. (1976) 22, 245 18. Valleau, J.P., and Card, D.N. J. Chem. Phys. (1972) 57, 5457 19. T o r r i e , G. and Valleau, J.P., J . Comp. Phys. to be published 20. Pechukas, P. and McLafferty, F.J., J . Chem. Phys. (1973) 58, 1622
4. BENNETT Molecular Dynamics and Transition State Theory 97
21. Chapman, S., Hornstein, S.M., and Miller, W.H., J . Am. Chem. Soc. (1975) 97, 892 22. Fletcher, R. Computer J . (1970) 13, 317 23. Fletcher, R. and Reeves, C.M., Computer J . (1964) 7, 149 24. McIver, J.W. and Komornicki, ., J . Amer. Chem. Soc. (1972) 94, 2625 25. Wigner, ., Z. Physik Chem. (1932) B19, 203 26. Johnston, H.S., "Gas Phase Reaction Rate Theory", pp. 133-134, Ronald Press, New York, 1966 27. i b i d . , pp. 190 ff. 28. Chapman, S., Garrett, B.C., and Miller, W.H., J . Chem. Phys. (1975) 63, 2710 29. M i l l e r , W.H., J . Chem. Phys. (1974) 61, 1823 30. Johnston, op.cit.,pp. 310 ff. 31. L e v i t t , M. and Warshel, A. Nature (1975) 253, 694 32. L e v i t t , M., J . Mol. Biol. (1976) 104, 59 33. Gibson, K.D. and Scheraga, H.A., Proc. Nat. Acad. S c i . USA (1969) 63, 9 34. L e v i t t , M. private communication 35. Bennett, C.H., Report of 1976 Workshop on Protein Dynamics, Centre Europen de Calcul Atomique et Molculaire, Orsay 91405, France 36. Dewar, M.J.S. and Kirschner, S., J . Amer. Chem. Soc. (1971) 93, 4291 37. Bennett, C.H., J . Comp. Phys. (1975) 19, 267 38. Ryckaert, J.P., C i c c o t t i , G. and Berendsen, H.J.C., J. Comp. Phys., to be published 39. Torrie, G., Val1eau, J.P., and Bain, ., J . Chem. Phys. (1973), 58, 5479
5
Newer Computing Techniques for Molecular Structure Studies by X-Ray Crystallography
DAVID J. DUCHAMP The Upjohn Co., Kalamazoo, MI 49001
Crystal!ographers have been users o f computers ever s i n c e computers became a v a i l a b l e f o r s c i e n t i f i c c a l c u l a t i o n s . The nature o f c r y s t a l l o g r a p h i c c a l c u l a t i o n s used i n molecular s t r u c t u r e d e t e r m i n a t i o n l a r g e amounts o f data t o be t r e a t e d by r a t h e r complicated mathematicsmakes e f f i c i e n t use o f computers e s s e n t i a l and l e d q u i t e e a r l y t o the development o f r a t h e r sophist i c a t e d techniques f o r both manual and computer computations. The f e a t u r e s which make c r y s t a l l o g r a p h i c c a l c u l a t i o n s somewhat d i f f e r e n t i n c l u d e : 1) the use o f symmetry, i.e. space groups, 2) the use o f a g e n e r a l i z e d c o o r d i n a t e system, 3) the t h r e e dimensional nature o f both data and intermediate and f i n a l r e s u l t s 4) the high p r e c i s i o n o f the r e s u l t s , l e a d i n g t o generous use o f s t a t i s t i c s , 5) use o f computer c o n t r o l l e d data a c q u i s i t i o n , and 6) the need f o r d i s p l a y and p r e s e n t a t i o n o f three-dimensional molecular s t r u c t u r e i n f o r m a t i o n . For the most p a r t , these a r e the areas i n which c r y s t a l 1ographers have tended t o be i n the f o r e f r o n t i n a l g o r i t h m development. This paper concentrates on newer computing techniques, t r y i n g t o g i v e a sampling o f r e c e n t l y developed techniques, which may be u s e f u l t o both c r y s t a l 1ographers and n o n - c r y s t a l l o g r a phers. M a t e r i a l judged o n l y understandable w i t h i n depth c r y s t a l l o g r a p h i c background has been omitted. Apologies are made f o r the omission o f many " f a v o r i t e " a l g o r i t h m s . Since many o f the algorithms a r e unpublished, the more d e t a i l e d d e s c r i p t i o n s a r e taken o f n e c e s s i t y from the author's own experience. The o l d e r algorithms not discussed here are well described i n standard reference works, such as "The I n t e r n a t i o n a l Tables f o r X-ray Crystal!oaraphy" (1) and textbooks by R o l l e t t {2) and Stout and Jensen ( 3 J . In a d d i t i o n , many o f the algorithms used i n c r y s t a l l o g r a p h i c computing a r e taken from numerical a n a l y s i s (4) o r are d i r e c t a p p l i c a t i o n s o f standard computing algorithms such as those used i n s o r t i n g data. The recent textbook o f Aho, Hopcroft and Ullman (5) (and the references t h e r e i n ) provide an e x c e l l e n t i n t r o d u c t i o n t o the l i t e r a t u r e o f general purpose computing a l g o r i t h m s , as w e l l as an i n t r o d u c t i o n t o the s t r a t e g i e s used i n
98
5.
DUCHAMP
Molecular Structure Studies
99
development o f e f f i c i e n t a l g o r i t h m s . Computing Techniques f o r X-ray D i f f r a c t o m e t e r s In most computer-controlled d i f f r a c t o m e t e r systems, the computer has c o n t r o l of the s e t t i n g s and r a t e of change o f the angles ( u s u a l l y 4) which determine the o r i e n t a t i o n of the c r y s t a l and the p o s i t i o n o f the r a d i a t i o n d e t e c t o r r e l a t i v e to the i n c i dent X-ray beam. I t can a l s o u s u a l l y open and c l o s e the i n c i d e n t beam s h u t t e r , and c o n t r o l the counting of pulses from the det e c t o r . The b a s i c process of data c o l l e c t i o n , which a l l systems can perform, c o n s i s t o f : f o r each r e f l e c t i o n 1) c a l c u l a t e the s e t t i n g s of the a n g l e s , 2) move the d i f f r a c t o m e t e r goniometer t o those s e t t i n g s , 3) measure the i n t e n s i t y of the r e f l e c t i o n , and 4) output the measured i n t e n s i t y . In a d d i t i o n most systems have enhancements, such as a program to a i d i n determining the o r i e n t a t i o n of the c r y s t a l on the instrument. U s u a l l y a f a i r amount of manual operation i s r e q u i r e d i n s e t t i n g up the experiment, i n c l u d i n g the c o r r e c t indexing o f the r e f l e c t i o n s . In most cases, the c r y s t a l l o g r a p h e r has l i t t l e c o n t r o l over the computer programs, s i n c e they are most o f t e n coded i n assemb l e r language on a small minicomputer, and are t h e r e f o r e d i f f i c u l t to modify. In some l a b o r a t o r i e s , however, most of the programs are w r i t t e n i n an e a s i l y changed high l e v e l language, making i t easy to modify the a l g o r i t h m s used f o r programmed experiments, and to develop programs f o r new experiments. In the system i n our l a b o r a t o r y (Figure 1 ) , a small instrument c o n t r o l minicomputer operates as a s l a v e to a l a r g e r l a b automation computer. When a F o r t r a n program running i n the l a r g e r computer wants a s p e c i f i c task performed on the d i f f r a c t o m e t e r , i t loads a program i n t o the minicomputer (unless the program i s already t h e r e ) , and sends i t i n f o r m a t i o n f o r the task to be performed. At task complete, the F o r t r a n programs i n the l a r g e r computer process the r e s u l t and determine the course of the experiment. G e t t i n g a piece o f information measured on the d i f f r a c t o m e t e r i s f u n c t i o n a l l y s i m i l a r to c a l l i n g a subroutine which r e t u r n s a f t e r the information i s a v a i l a b l e . An a l t e r n a t i v e way to achieve the same f l e x i b i l i t y i s to b u i l d up the instrument c o n t r o l minicomputer i n t o a much l a r g e r system. Several improvements to the b a s i c data c o l l e c t i o n a l g o r i t h m have been made. Perhaps the most s i g n i f i c a n t i s the use of the step-scan technique, v e r s i o n s of which were developed i n 1969 f o r our computerized d i f f r a c t o m e t e r , and s i m u l t a n e o u s l y elsewhere. The usual method o f i n t e g r a t e d i n t e n s i t y measurement i s to scan c o n t i n u a l l y through the r e f l e c t i o n p r o f i l e , accumulating counts c o n t i n u o u s l y , then to measure the background by counting f o r f i x e d time a t each extreme of the p r o f i l e ( 6 ) . B l e s s i n g , Coppens, and Becker have r e c e n t l y discussed the step-scan procedure (7). B a s i c a l l y i t c o n s i s t s o f sampling the peak p r o f i l e a t a number o f p o i n t s , perhaps 50 to 100, see Figure 2. Computer a n a l y s i s o f
100
ALGORITHMS
FOR C H E M I C A L
COMPUTATIONS
INSTRUMENT CONTROL MINICOMPUTER TERMINAL
- ( A N G L E CONTROL)-
DIFFRACTOMETER
-QNGLE P0SIT10N>
C0U
NTER ) SHUTTER
COMMANDS, PROGRAMS, TERMINAL OUTPUT
DATA, TERMINAL INPUT
UPACS MULTI INSTRUMENT LAB AUTOMATION COMPUTER Figure 1.
DISK
UPACS computer-controlled diffractometer system
BACKGROUND
PEAK
j BACKGROUND
Figure 2. Step scan data collection
5.
DUCHAMP
101
the recorded p r o f i l e provides many advantages over the " b l i n d " continuous scan mode, a l l o w i n g a much s u p e r i o r background c o r r e c t i o n , making p o s s i b l e the d e t e c t i o n of abnormal p r o f i l e s , and producing a r e d u c t i o n i n experimental standard d e v i a t i o n s over the former method. In a d d i t i o n the step-scan experiment i s g e n e r a l l y f a s t e r s i n c e the time spent counting background i s e l i m i n a t e d . Further work on processing step-scan data (8, 9) and f u r t h e r work o p t i m i z i n g the measurement of x-ray i n t e n s i t i e s (10, , 12) have r e c e n t l y appeared; the references i n those papers provide access to the e a r l i e r l i t e r a t u r e on t h i s s u b j e c t . In a d d i t i o n to the improvement of the b a s i c data c o l l e c t i o n procedures, programs and algorithms are being developed f o r other experiments to a s s i s t i n the use of the d i f f r a c t o m e t e r and t o make the process more automatic. Progress i n t h i s area has been slow as r e c e n t l y pointed out by Spinrad (V3). The goal of being able to drop a c r y s t a l i n a magic funnel and have e v e r y t h i n g happen a u t o m a t i c a l l y i s not i n s i g h t , however, s i g n i f i c a n t auto matic enhancements are being made. Procedures to a i d i n indexing r e f l e c t i o n s were developed by Sparks (^4) and more r e c e n t l y by Jacobson (15); i n our l a b o r a t o r y a procedure i n v o l v i n g somewhat more i n t e r a c t i o n w i t h the d i f f r a c t o m e t e r i s under development. Two experiments which we have found very u s e f u l - - p r e c i s e a l i g n ment of the x-ray tube and determination of p r e c i s i o n u n i t c e l l p a r a m e t e r s a r e d e s c r i b e d i n d e t a i l below. When the x-ray tube i s changed on a d i f f r a c t o m e t e r i t must be p o s i t i o n e d very p r e c i s e l y to center the x-ray beam i n the i n c i d e n t beam c o l i m a t o r . This i s accomplished by t r a n s l a t i n g the tube i n the plane p e r p e n d i c u l a r to the c o l i m a t o r . Approximate p o s i t i o n i n g i s e a s i l y accomplished manually. Then a t e s t c r y s t a l i s placed on the d i f f r a c t o m e t e r , and from angle values obtained by c e n t e r i n g c e r t a i n r e f l e c t i o n s i n the d e t e c t o r , misalignment o f the tube may be i n f e r r e d . The process i s complicated by s l i g h t d e v i a t i o n s o f the c r y s t a l from the center of the goniometer (both i n height along the a x i s and t r a n s l a t i o n (normal t o i t ) , the a r b i t r a r y zero p o i n t of the 0 angle, and p o s s i b l e misalignments o f the zero p o i n t s o f the 2, , and a n g l e s - - a l l of which a f f e c t the c e n t e r i n g o f a r e f l e c t i o n i n the d e t e c t o r . In our procedure, the user mounts the t e s t c r y s t a l , invokes the proce dure and g i v e s the computer approximate s e t t i n g angles f o r one o r more r e f l e c t i o n s . The computer measures accurate c e n t e r i n g angles f o r each t e s t r e f l e c t i o n at the 8 p o s s i b l e p o s i t i o n s w i t h = , as shown i n Table 1(a). From t h i s d a t a , a simple a l g o r i t h m a l l o w s the computer t o separate the d i f f e r e n t v a r i a b l e s , and to d i r e c t the user e x a c t l y (to w i t h i n the approximation o f small t r a n s l a t i o n s ) how f a r and i n what d i r e c t i o n to move the tube, see Table 1(b). Other v a l u a b l e i n f o r m a t i o n d e r i v e d from t h i s experiment are accurate determinations of the t r u e zero's o f the , 2, and angles. The d e t a i l e d equations are not pre sented here, s i n c e they vary w i t h goniometer geometry, however a s h o r t F o r t r a n program f o r performing the c a l c u l a t i o n f o r the
102
A L G O R I T H M S FOR C H E M I C A L C O M P U T A T I O N S
Table I a) Settings with = 2 2 -2 -2 2 2 -2

0 0 0 0 0 0 + 180 0 + 180 0 + 180 0 + 180
+ 180 + 180
2/2 -2/2 -2/2 2/2 2/2 -2/2 -2/2 2/2
- -
-2 29 b)
180 - 180 -
Computer r e p o r t (retyped f o r c l a r i t y )
X-RAY ALIGNMENT REPORT AFTER-ADJUST-AGAIN 3/4/75 12812 0 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 0 TTH 16.46 -16.45 -16.44 16.46 16.45 -16.46 -16.46 16.45 OMEGA 8.21 -8.23 -8.21 8.23 8.22 -8.22 -8.23 8.22 PHI 332.04 332.04 332.04 332.04 152.03 152.03 152.03 152.03 CHI INT
78.98 815 79.20 830 180+79.20 811 180+78.96 783 - 79.16 917 - 79.07 903 180-78.95 896 180-79.28 913
PHI ERROR = -0.022 PHI (CORRECTED) = 332.062 CHI (AVE) = 79.105 AVE DEL (CHI) = 0.110 NEED TO MOVE TUBE DOWN 3.2 DIVISIONS CHI (ZERO) = -0.015 OMEGA ERROR FROM CENTERING = -0.000 PROBABLY CRYSTAL HEIGHT APPARENT TTH (ZERO) = 0.001 APPARENT OMEGA (ZERO) = -0.000 NEED TO MOVE TUBE OUT 0.3 DIVISIONS FOR TTH OR MOVE TUBE IN 0.1 DIVISIONS FOR OMEGA
5.
DUCHAMP
103
Syntex d i f f r a c t o m e t e r i s a v a i l a b l e from the author on request. Although a determination o f the u n i t c e l l parameters r e s u l t s from determination o f the o r i e n t a t i o n and i n d i c e s o f several r e f l e c t i o n s used t o i n i t i a t e the data c o l l e c t i o n experiment, we have found t h a t a c o n s i d e r a b l y more accurate determination may be made by running a separate experiment i n v o l v i n g only measuring 2 values f o r high 2 r e f l e c t i o n s . Depending upon the c r y s t a l system, 1, 2, 4, o r 6 o f the u n i t - c e l l a x i a l lengths and i n t e r a x i a l angles have t o be measured e x p e r i m e n t a l l y , the remaining parameters being f i x e d by symmetry. The symmetry o f the u n i t c e l l i s important and must be used i n p r e c i s i o n u n i t - c e l l d e t e r mination. The procedure c o n s i s t s o f f o u r steps: 1) the computer surveying the i n t e n s i t i e s o f p r e v i o u s l y measured r e f l e c t i o n s t o choose about 20 high 2 r e f l e c t i o n s , 2) making h i g h l y accurate step-scans o f the s e l e c t e d r e f l e c t i o n s , 3) c a l c u l a t i n g accurate 2-theta values from the scan data; and 4) c a l c u l a t i n g u n i t - c e l l parameters from accurate 2-theta measurements. The method used to c a l c u l a t e the "best" 2-theta f o r each r e f l e c t i o n from step-scan data was developed e s p e c i a l l y f o r t h i s system. Each peak i s a c t u a l l y a doublet--one peak due to r a d i a t i o n and another due t o a r a d i a t i o n . The method assumes t h a t t h i s doublet may be f i t by the sum o f two Gaussian curves separated by 2 which can be c a l c u l a t e d from the wavelengths and the approximate 2-theta o f the 0 4 peak:
2
2 -2 \
1 1
I 2 .-(2!+2)
2e *
/A +e
-I
(1) +d
where I., i s the c a l c u l a t e d count a t 2. ; w, c, and d are parame t e r s dependent upon peak w i d t h , peak height, and background, r e s p e c t i v e l y . The "best" 2-theta, 2 above, i s c a l c u l a t e d by a n o n - l i n e a r l e a s t - s q u a r e s procedure which v a r i e s c, w, and 2 t o minimize
all steps
[giidô - di) )] .
c
(2)
where g. i s the weight c a l c u l a t e d by t a k i n g the r e c i p r o c a l o f the standard d e v i a t i o n (from counting s t a t i s t i c s ) o f ( )

0
The value o f d i s c a l c u l a t e d by averaging step-scan observa t i o n s a t ends o f the scan, and i s not v a r i e d during the l e a s t squares procedure. D e r i v a t i v e s are c a l c u l a t e d a n a l y t i c a l l y using expressions obtained by d i f f e r e n t i a t i n g equation 1. Up t o 10 i t e r a t i o n s a r e allowed; 3 t o 5 a r e u s u a l l y r e q u i r e d . When the method was developed, the e f f e c t s o f c, w, and d on 2 and (2 ), the e r r o r estimate f o r 2 , were thoroughly

104
COMPUTATIONS
s t u d i e d . The value used f o r d was found t o have l i t t l e o r no e f f e c t on e i t h e r 2 o r (2 ), unless an u t t e r l y r i d i c u l o u s d value was assumed. Therefore d i s not i n the refinement. The values o f c and w were found t o have only small e f f e c t s on 2 but somewhat l a r g e r e f f e c t s on (2!). The two parameters c and w are s t r o n g l y c o r r e l a t e d a l l o w i n g l a r g e s h i f t s i n c before w has quieted down r e s u l t s i n an unstable refinement. C a l c u l a t i o n o f the u n i t - c e l l parameters from the 2 data i s accomplished by a s p e c i a l adaptation o f a method used i n several l a b o r a t o r i e s f o r determining accurate c e l l parameters from spe c i a l f i l m data (16). For the general case
2 9
j
sin 0 = h a*
+ k b*
+ c * + 2kb*c* cosy*
cosa* + (3)
2ha*c*
cos3* + 2hk a*b*
where h, k, and are r e f l e c t i o n i n d i c e s ; a*, b*, c*, a*, 3*, * are a x i a l lengths and i n t e r a x i a l angles o f the r e c i p r o c a l c e l l . Equation 9 may be abbreviated as sin 0 = h s
2 2 x
+ k s
+ s
+ kis
+ hs
+ hks
(4) s
l 9
The l i n e a r l e a s t - s q u a r e s procedure determines so as t o minimize
... , s
^w^CisinÔo i=l
(sin^d
(5)
Comparison o f equations 3 and 4 shows immediately how t o c a l c u l a t e the r e c i p r o c a l c e l l parameters from the c o e f f i c i e n t s i n 4. From these, the u n i t c e l l parameters may be c a l c u l a t e d using standard expressions (17). The weight o f each observation i s c a l c u l a t e d by 1
W
( s i n 2)(29)
( 6 )
The e f f e c t o f symmetry i s c o n v e n i e n t l y taken i n t o account by r e s t r i c t i o n s on ... , s as f o l l o w s :

6
C r y s t a l System Triclinic Monoclinic Orthorhombic Tetragonal Hexagonal Cubic
To Be Determined S i , ... , s Si, s ,S3,s S i , s , S3 s i ,s s s

6 2 2 3 l f 3
Restrictions None s = s =0 s^ = s = s = 0 S 2 = s i , Sh = ss = S 6 = 0 s = s = s , s = s =0 s = s = s 5 ^ = 5 5 = s
4 6 5 6 2 6 x 5 4 3 2 l 9
Sx
=0
5.
DUCHAMP
105
Standard d e v i a t i o n s i n u n i t - c e l l parameters may be c a l c u l a t e d a n a l y t i c a l l y by e r r o r propagation. In these programs, however, the Jacobian o f the t r a n s f o r m a t i o n from s ... , s t o u n i t - c e l l parameters and volume i s evaluated n u m e r i c a l l y and used to transform the variance-covariance m a t r i x o f ... , s i n t o the v a r i a n c e s o f the c e l l parameters and volume from which standard d e v i a t i o n s are c a l c u l a t e d . I f s u i t a b l e standard d e v i a t i o n s are not obtained f o r c e r t a i n o f the u n i t c e l l parameters, i t i s easy t o program the computer t o measure a d d i t i o n a l r e f l e c t i o n s which s t r o n g l y c o r r e l a t e w i t h the d e s i r e d parameters, and repeat the f i n a l c a l c u l a t i o n s w i t h t h i s a d d i t i o n a l data.
l 5 6 6
Treatment o f C r y s t a l D e t e r i o r a t i o n :
The v a r i a t i o n o f the i n t e g r a t e d i n t e n s i t i e s o f X-ray r e f l e c t i o n as a f u n c t i o n o f time o f exposure t o X-rays i s a problem which has plagued c r y s t a l ! o g r a p h e r s f o r some time. L i t t l e i s known o f the p h y s i c a l and chemical processes l e a d i n g to r a d i a t i o n damage (18). U s u a l l y several c a r e f u l l y chosen r e f l e c t i o n s (check r e f l e c t i o n s ) are repeated a t r e g u l a r i n t e r v a l s during data c o l l e c t i o n . The problem i s how best t o use the f l u c t u a t i o n s i n these measured i n t e n s i t i e s t o s c a l e the observed s e t o f i n t e n s i t i e s . We use 10 check r e f l e c t i o n s a f t e r experimenting w i t h more and fewer. Since the f l u c t u a t i o n s o f i n t e n s i t y w i t h time are almost always n o n - l i n e a r , and f r e q u e n t l y a r e non-monotonic a l s o , a f a i r l y complicated f u n c t i o n i s r e q u i r e d t o express the deterioration scale factor. In the procedure described here, the s c a l e f a c t o r i s represented as a f u n c t i o n o f time C ( t ) described mathematically by C(t) = a i f i ( t ) + a f ( t ) + ... + a f ( t )
2 2 p p
(7)
where t i s the cumulative exposure time o f the c r y s t a l , the f ^ U ) are f u n c t i o n s o f t , and the a^ are the c o e f f i c i e n t s t o be d e t e r The mined from the check r e f l e c t i o n data t o s p e c i f y C ( t ) .
c r i t e r i a chosen i s t o determine the a^ so as t o minimize the sum of the weighted second moments about the means o f the s c a l e d check r e f l e c t i o n i n t e n s i t i e s . With a second Lagrange undetermined m u l t i p l i e r term added t o avoid the t r i v i a l minimum, the f u n c t i o n minimized becomes
106
COMPUTATIONS
where t . . i s the time f o r the i

IJ
^ o b s e r v a t i o n of the j * * check The weights w. are defined by

J
r e f l e c t i o n , g.. i s i t s i n t e n s i t y .
= j
^j
- 1
(8)
where .. i s the standard d e v i a t i o n i n g.., and m. i s the number of observations o f check r e f l e c t i o n j . The b. are defined by
j
bj = m j - i c(tij)gij (9)
By s u i t a b l e mathematical manipulation the above may be shown t o be a l i n e a r l e a s t - s q u a r e s w i t h c o n s t r a i n t problem i n the v a r i ables a^. Before the a can be determined, the f u n c t i o n s f ( t )
k k
must be s p e c i f i e d . I f C ( t ) i s chosen to be a simple polynomial i n t , (i.e., k-1 fjjt) = t ), and a d i r e c t l e a s t - s q u a r e s s o l u t i o n i s c a l c u l a t e d , c a l c u l a t i o n t r o u b l e u s u a l l y r e s u l t s s i n c e the determinant of the c o e f f i c i e n t s o f the normal equations tends to be very small (19). A C ( t ) w i t h a l l the f l e x i b i l i t y of the general polynomial i s obtained, and the numerical problem i s avoided by choosing the f ( t ) t o be the orthogonal polynomials o f Forsythe (19). Cast
k
in our n o t a t i o n , the f ( t ) are defined r e c u r s i v e l y by

k
fi(t) = 1 f (t)
2
= (t - u ) f ( t )
2 x
f (t)
3
= (t - u ) f ( t ) - v f i ( t )
3 2 2
f (t)
k
= (t - u )f .-,(t) - v .-,f . (t)

k k k k 2
(11)
where
U l k
= _ U dk-1
_ _hi
(13) d _
k 2
k-i k =
9J
(f (t
k
i j ) ) 2
(14)
5.
DUCHAMP
107
In t h i s f o r m u l a t i o n , the needed c o e f f i c i e n t s a^ may be c a l c u l a t e d d i r e c t l y without recourse to s o l v i n g the usual eigenvector problem. In our programs p r o v i s i o n i s a l s o made f o r a dependence o f s c a l e f a c t o r on d i r e c t i o n i n the c r y s t a l , h and on the Bragg angle, . A new s c a l e f a c t o r C'(t,Ji,) i s defined as
9
C'(t,h.0) = 1 + ( C ( t ) - l ) H(h) E(9)
(15)
where C ( t ) i s our o r i g i n a l f u n c t i o n i n time, H{h) i s a d i r e c t i o n dependent f a c t o r w i t h s i x determinable parameters, and () i s a f a c t o r w i t h one determinable parameter. The c o e f f i c i e n t s o f t h i s g e n e r a l i z e d s c a l e f a c t o r f u n c t i o n i s determined to minimize the same q u a n t i t y w i t h C r e p l a c i n g C, by f i r s t s o l v i n g as before w i t h the new parameters set so t h a t H{h) = () =1.0, then a l l o w i n g a l l parameters to vary from t h a t p o i n t i n an i t e r a t i v e m i n i m i z a t i o n procedure s i m i l a r to "steepest descents". A more d e t a i l e d d e s c r i p t i o n of the g e n e r a l i z e d s c a l e f a c t o r f u n c t i o n i s contained i n an implementation of t h i s s c a l i n g a l g o r i t h m i n a F o r t r a n data r e d u c t i o n program a v a i l a b l e from the author. Hidden L i n e A l g o r i t h m s : In the d i s p l a y of a three dimensional o b j e c t on a p l o t t e r o r on the screen of a graphics t e r m i n a l , the task of d e c i d i n g which p a r t s of the o b j e c t should be shown and which should be e l i m i nated (or made dashed) i s known as the "hidden l i n e problem". This problem and the more complicated "hidden surface problem" has r e c e n t l y been reviewed by Sutherland, S p r o u l l and Schumacker (20) from a s o r t i n g p o i n t of view. These algorithms are espe c i a l l y important because programs w i t h i n e f f i c i e n t hidden l i n e algorithms can use up enormous amounts of computer time and because manual "touch up" of drawings to e l i m i n a t e hidden l i n e e r r o r s may be q u i t e time consuming. The most e f f i c i e n t a l g o rithms r e s u l t when the o b j e c t to be drawn has s p e c i a l f e a t u r e s which a l l o w the general problems to be s i m p l i f i e d . Two problems are t r e a t e d here i n some d e t a i l : the drawing of a c r y s t a l from face measurements and the drawing of a " b a l l and s t i c k " repre s e n t a t i o n of a molecule. The problem of producing of a c r y s t a l l i k e t h a t shown i n Figure 3 arose i n a graphics program (21) used to v i s u a l l y compare the computer d e s c r i p t i o n of a c r y s t a l as a convex poly hedron w i t h the c r y s t a l as viewed on an o p t i c a l goniometer. The problem i s one of d i s p l a y i n g a convex polyhedron given the information d e s c r i b i n g the faces of the polyhedron. From t h i s i n f o r m a t i o n the faces which i n t e r s e c t at the v a r i o u s corners and the coordinates of the corners can e a s i l y be computed (22). From t h i s , a l i s t of edges--the l i n e s a c t u a l l y to be drawn i n the f i g u r e - - c a n e a s i l y be compiled.
108
ALGORITHMS
FOR C H E M I C A L
COMPUTATIONS
In producing the drawing, a r o t a t i o n of the coordinates o f the corners i s performed to g i v e a s e t o f x,y,z r e l a t i v e to an o r i g i n a t the center w i t h the a x i s a l i g n e d w i t h the viewing d i r e c t i o n . Next i s i d e n t i f i c a t i o n o f those edges which l i e on the convex polygon which d e f i n e s the periphery of the polyhedron i n p r o j e c t i o n on the y,z plane. For each edge, d e f i n e d by two corners i and j , the edge i s on the polygon i f a l l other corners e i t h e r l i e on the edge or on one s i d e o f i t i n p r o j e c t i o n on the y,z plane, or simply i f > 0 for all k or o r 0 for all k
( z
r j
) y
^j"V k V j "
+
i*j
<
<>
16
For s i m p l i c i t y i n p r a c t i c e , the =0 case i s e l i m i n a t e d by s l i g h t t r a n s l a t i o n o f corner c o o r d i n a t e s . A l l other edges are e i t h e r " t o t a l l y hidden" o r " t o t a l l y v i s i b l e " . The "hidden l i n e " prob lem, t h e r e f o r e , becomes one of c l a s s i f y i n g the edges (the l i n e s a c t u a l l y drawn) i n t o one of the three c a t e g o r i e s . A l s o a " t o t a l l y hidden" edge may not connect w i t h a " t o t a l l y v i s i b l e " edge except through one o f the corners on the p e r i p h e r a l polygon. Because o f the convex property of the polyhedron, other edges may be c l a s s i f i e d by c o n n e c t i v i t y i f one edge not on the polygon i s c l a s s i f i e d . This i s accomplished e a s i l y by f i n d i n g two edges defined by corners i , k and i , j where corners i and j are on the polygon and k i s not. The edge defined by corners i , k i s e i t h e r " t o t a l l y v i s i b l e " o r " t o t a l l y hidden" according as a and d d e f i n e d below have the same or opposite s i g n s , r e s p e c t i v e l y . 3 = ^ .) .( 2^ ^ .)
+ +
(17)
k yi r j i
i j k-yk j
( y
) +
j r i k'
( 1 8 )
As many u n c l a s s i f i e d edges are c l a s s i f i e d by c o n n e c t i v i t y as p o s s i b l e . Then i f u n c l a s s i f i e d edges remain, equations 17 and 18 are used t o c l a s s i f y another, e t c . , u n t i l a l l edges are c l a s sified. In the DRAW program which we developed,the "hidden l i n e " a l g o r i t h m f o r b a l l and s t i c k drawings o f molecules (such as Figure 4) l i k e w i s e makes use o f s p e c i a l f e a t u r e s o f the o b j e c t . The drawing i s composed o f only two kinds o f f i g u r e s - - c i r c u l a r atoms and t r a p e z o i d a l bonds. Our a l g o r i t h m i s s i m i l a r to one developed by Okaya (23). The more complicated case of general e l i p s o i d a l r e p r e s e n t a t i o n of atoms has been t r e a t e d by Johnson i n the l a t e s t v e r s i o n o f h i s h e a v i l y used 0RTEP program (24). In p r i n c i p l e , when each atom or bond i s drawn, i t must be t e s t e d a g a i n s t a l l other bonds and atoms to see i f i t i s hidden, t o t a l l y o r i n p a r t . In the drawing o p e r a t i o n , each atom o r bond i s represented by a number ( u s u a l l y 100 to 200) o f p o i n t s w i t h
5.
DUCHAMP
109
Figure 3. Computer drawing of crystal from face description
Figure 4. Ball and stick drawing of molecule of p-bromophenacyl ester of tirandamycic acid
110
COMPUTATIONS
s t r a i g h t l i n e s connecting them; a separate v i s i b i l i t y t e s t must be made on each p o i n t i n d e c i d i n g whether t o draw the l i n e s t o and from i t . As the number o f atoms (n) grows, the complexity o f the c a l c u l a t i o n increases as n . By using an a p p l i c a t i o n o f the " d i v i d e and conquer" s t r a t e g y (25), the problem i s reduced t o a very quick approximately n complexity p a r t and a more time consuming almost complexity p a r t . A t the time each atom o r bond f i g u r e i s drawn, a quick t e s t i s employed t o compile a l i s t of those atoms o r bonds which could p o s s i b l y overlap i n the f i g u r e . In p r a c t i c e the s i z e o f t h i s l i s t , a f t e r reaching a c e r t a i n l e v e l , does not increase very much as i n c r e a s e s . This i s e a s i l y understood by c o n s i d e r i n g t h a t : 1) f o r m randomly d i s t r i b u t e d o b j e c t s w i t h i n a volume, the "object t h i c k n e s s " i s the cube roote o f m, and 2) i n order t o make a drawing under standable, people u s u a l l y draw f i g u r e s w i t h minimum overlap i n the p r o j e c t i o n d i r e c t i o n . Therefore the time consuming p o i n t by p o i n t t e s t s a r e performed only on a g r e a t l y reduced number o f f i g u r e s . A number o f enhancements can be made t o the point-byp o i n t t e s t which speed i t up but do not reduce i t s complexity. On the other hand, i n p r i n c i p l e , the complexity o f the p r e t e s t p o r t i o n can be reduced from n t o n / by o r d e r i n g the bonds and atoms i n the longest d i r e c t i o n i n the plane o f p r o j e c t i o n , and o n l y t e s t i n g f i g u r e s l y i n g i n a r e l e v a n t band. Since the p r e t e s t i s so f a s t , we have not implemented t h i s f i n a l refinement i n the batch v e r s i o n s o f our program; however, i t i s under c o n s i d e r a t i o n f o r a graphics v e r s i o n now being implemented.
2 2 2 3 2
Use o f the Fast F o u r i e r Transform: Although the p r i n c i p l e o f the f a s t F o u r i e r transform (FFT) a l g o r i t h m has been w i d e l y understood f o r over ten years (26^ 2 7 ) , the FFT i s only now beginning t o be used w i d e l y f o r c r y s t a l l o g r a p h i c c a l c u l a t i o n s . The reasons f o r t h i s are: 1) the advantages o f the FFT a r e not n e a r l y as great i n c r y s t a l l o g r a p h i c computing as i n other f i e l d s , 2) c r y s t a l l o g r a p h i c t r i g o n o m e t r i c F o u r i e r algorithms (28) have been h i g h l y developed and are very e f f i c i e n t , and 3) i n c o r p o r a t i o n o f the s p e c i a l f e a t u r e s o f c r y s t a l l o g r a p h i c c a l c u l a t i o n s , such as symmetry, has r e q u i r e d a d d i t i o n a l a l g o r i t h m development. In the s i m p l e r FFT a p p l i c a t i o n s t o chemistry, such as i n F o u r i e r transform spectroscopy, the tremendous advantage o f t h e FFT a l g o r i t h m a r i s e s because f o r computing F o u r i e r c o e f f i c i e n t s from data p o i n t s , t h e FFT a l g o r i t h m reduces t h e complexity from n t o l o g n. T h i s i s brought about by f a c t o r i n g the transform very f i n e l y so as t o a l l o w c a l c u l a t i o n s common t o s e v e r a l t r a n s formed p o i n t s t o be performed only once. In the e f f i c i e n t f o r mulations o f the c r y s t a l l o g r a p h i c t r i g o n o m e t r i c F o u r i e r a l g o r i t h m , a c e r t a i n amount o f f a c t o r i n g i s employed, l e a d i n g t o a complexity o f approximately n ' (29) i n s t e a d o f t h e u s u a l l y quoted n . In an e a r l y comparison"T29), f a c t o r s i n improvement
2 4 3 2
5.
DUCHAMP
111
by use o f the FFT o f 1.8 to 19.0 were achieved by use of the FFT a l g o r i t h m ; v e r i f y i n g t h a t the thousand f o l d gains found i n other areas are not present i n the c r y s t a l l o g r a p h i c case. For high a p o i n t i s reached where the FFT i s more e f f i c i e n t . The s i z e o f the problem necessary f o r the FFT a l g o r i t h m to be c o n s i d e r a b l y f a s t e r depends on the e f f i c i e n c y o f the implementations of the r e s p e c t i v e a l g o r i t h m s ; i.e., i t depends upon the c o e f f i c i e n t s which m u l t i p l y the complexity f a c t o r to g i v e the c o s t of the c a l c u l a t i o n . I t i s not s u r p r i s i n g t h a t the area which i s making the most use of the FFT i s macromolecular c r y s t a l l o g r a p h y where values of are u s u a l l y very l a r g e . Considerable work has been done r e c e n t l y on the problems o f developing the FFT f o r c r y s t a l l o g r a p h i c use. The problem o f i n c o r p o r a t i n g space group symmetry has been e l e g a n t l y t r e a t e d by Ten Eyck (30) and i n a simpler f a s h i o n by Bantz and Zwick (31). Other implementations i n c l u d e those of Immirzi (32J and Lange, S t o l l e and Huttner ( 3 3 ) , both of which t r e a t the problem of the enormous amount o f computer storage r e q u i r e d to s t o r e an e n t i r e c r y s t a l l o g r a p h i c map (100,000 to 500,000 p o i n t s are f r e q u e n t l y r e q u i r e d ) , and a l s o the work of M a l l i n s o n and Teskey ( 3 4 ) , which d i s c u s s e s the problem o f handling negative i n d i c e s economically. In the f u t u r e the FFT a l g o r i t h m w i l l be more w i d e l y used i n small molecule as w e l l as macromolecular c r y s t a l l o g r a p h y , espec i a l l y as new e f f i c i e n t FFT programs are i n t e g r a t e d i n t o the v a r i o u s program systems used f o r such c a l c u l a t i o n s . In p r a c t i c e , a good general purpose program ( e s p e c i a l l y e f f i c i e n t f o r small molecule c r y s t a l l o g r a p h y ) could be developed by combining the strengths of the FFT and t r i g o n o m e t r i c techniques. The c r y s t a l l o g r a p h i c F o u r i e r t r a n s f o r m , whether i t be done by FFT or o t h e r , can be f a c t o r e d i n t o t h r e e p a r t s , a " f i r s t dimension" i n which summation i s made over the d i r e c t i o n normal to the s e c t i o n s o f the three-dimensional map, and a second and t h i r d dimension i n the plane of the map s e c t i o n s . A computer f o r m u l a t i o n of the t r i g o n o m e t r i c t r i p l e product technique which i n c o r p o r a t e s the space group symmetry almost e x c l u s i v e l y i n the f i r s t dimension o f the c a l c u l a t i o n i s a v a i l a b l e (35). A program which performs the f i r s t dimension c a l c u l a t i o n i n the t r a d i t i o n a l space-group s p e c i f i c manner, and performed the second and t h i r d dimensions by the FFT a l g o r i t h m would have several advantages. I t would make e f f i c i e n t use o f the f a c t t h a t i n most c r y s t a l l o g r a p h i c F o u r i e r c a l c u l a t i o n s there are 10 to 20 times more c a l c u l a t e d g r i d p o i n t s than input d a t a , without having to r e s o r t to l e s s e f f i c i e n t f o r m u l a t i o n s o f the FFT a l g o r i t h m s which r e q u i r e complex m u l t i p l i c a t i o n . I t would g r e a t l y a l l e v i a t e the storage problem, and would remove most o f the symmetry c o n s i d e r a t i o n s from the FFT p o r t i o n o f the c a l c u l a t i o n , leading to a simpler implementation at the inner most p a r t o f the c a l c u l a t i o n . This proposed program bears some s i m i l a r i t y t o the work o f Immirzi ( 3 2 J , where the FFT was not used i n the f i r s t dimension because of storage c o n s i d e r a t i o n s , but where symmetry was avoided by transforming the data
112
CHEMICAL
COMPUTATIONS
to t r i c l i n i c . In the l i m i t of high n, the proposed program would of n e c e s s i t y be slower than an a l l FFT program. In the case of small molecule -maps, where the r a t i o of g r i d p o i n t s to data i s e s p e c i a l l y high, t h i s program would be most e f f i c i e n t , i f done r i g h t , c o n s i d e r a b l y more e f f i c i e n t than an a l l FFT implementation. D i r e c t Methods: D i r e c t methods i s the most w i d e l y used techinque f o r g e t t i n g a t r i a l s t r u c t u r e i n small molecule c r y s t a l l o g r a p h y , and has i n c r e a s i n g a p p l i c a t i o n s i n macromolecular c r y s t a l l o g r a p h y as well (36). The problem i s one of f i n d i n g a s e t of approximate phases 0^ t o assign the observed normalized s t r u c t u r e f a c t o r magnitudes
|Ej
so t h a t a F o u r i e r transform c a l c u l a t i o n can be performed to
gi7e an e l e c t r o n d e n s i t y map from which atomic p o s i t i o n s can be d e r i v e d . Most computer programs f o r d i r e c t methods are based on the formula (37, 38) and the tangent formula ( 3 8 ) , both of which r e l a t e phases by equations which have c a l c u l a t e d proba b i l i t i e s o f being c o r r e c t . The phases r e l a t e d i n both cases are those o f r e f l e c t i o n t r i p l e s f o r which
2
h + k + = 0
9
(19)
where h Jc, and % are v e c t o r s whose components are the i n t e g e r _ i n d i c e s of the r e f l e c t i o n s which have l a r g e |E|. The a l g o r i t h m used to search f o r these t r i p l e s i s of primary importance to the e f f i c i e n c y o f most d i r e c t methods computer programs. The s e t of high |E| r e f l e c t i o n s u s u a l l y comprise 0.1 to 0.3 of the symmetry independent r e f l e c t i o n s . In the search, a l l the symmetry r e l a t e d r e f l e c t i o n s must be used f o r two of the r e f l e c t i o n s ; i n orthorhombic, f o r example, the symmetry independent set must be expanded 8 - f o l d e i t h e r p r i o r t o the c a l c u l a t i o n o r during each t e s t . The obvious t h r e e - l o o p way of f i n d i n g t r i p l e s leads to a n complexity a l g o r i t h m (and a l o t of wasted computer time). This can be changed to an n complexity procedure i f each r e f l e c t i o n i s a s s o c i a t e d u n i q u e l y w i t h an a r r a y s u b s c r i p t by some equation i n v o l v i n g the i n t e g e r i n d i c e s , so t h a t given ji and jc, the s u b s c r i p t of i_ can be c a l c u l a t e d and the presence of E i n the set can be check by t a b l e lookup. Perhaps the most e f f i c i e n t a l g o r i t h m (used i n several programs, i n c l u d i n g the program DIREC w r i t t e n by the author) i s one o r i g i n a l l y developed by Dewar f o r the MAGIC program (39). P r i o r t o the searching o p e r a t i o n , the set of high |E| r e f l e c t i o n s i s expanded to the f u l l s e t of r e f l e c t i o n s , and the h vectors are transformed i n t o a s e t of r e a l i n t e g e r s {m-j} i n such a way as t o preserve the a r i t h m e t i c r e l a t i o n s h i p among the h. One such mapping i s
3 2
m = 1000000 h
+ 1000 h
+ h
(20)
5.
DUCHAMP
113
where h v e c t o r components are h i h h . Since the range o f p o s s i b l e values of h ^ h ^ , and h i s r e s t r i c t e d , i f equation (19) h o l d s , the m values d e r i v e d from the three vectors w i l l a l s o sum to z e r o , and v i c e versa. Next the m, are sorted n u m e r i c a l l y w i t h -e l i m i n a t i o n of d u p l i c a t e s from the symmetry expansion. During these operations a p o i n t e r back to the o r i g i n a l r e f l e c t i o n and a symmetry operation code are c a r r i e d along with each m-j. The process of f i n d i n g a l l Jk and i_ which form t r i p l e s with Ih, i s thus transformed to the problem of f i n d i n g a l l p a i r s of i n t e g e r s from the ordered set {m-j} which sum to -n, where i s the " value" o f m h. The transformed problem has a very e f f i c i e n t s o l u t i o n i n v o l v i n g only one pass through {m-j}. Two p o i n t e r s ( i and j ) are i n i t i a l i z e d to p o i n t at the beginning and at the end of the set r e s p e c t i v e l y ; a l l t r i p l e s are found by moving i and j toward each other u n t i l they meet, using the procedure diagrammed i n Figure 5. The s i m p l i c i t y of t h i s procedure can r e a d i l y be appreciated i f the reader w i l l c o n s t r u c t an ordered a r r a y of 10 to 15 i n t e gers ( i n the range -20 to 20), and f o l l o w the a l g o r i t h m to f i n d p a i r s which sum to a given value. A l t e r n a t i v e l y the p o i n t e r s could be s t a r t e d a t as favored by Dewar (39), and moved outward i n a l i n e a r sweep using a s i m i l a r procedure. The a l g o r i t h m described above f o r f i n d i n g t r i p l e s may be extended to f i n d higher order r e l a t i o n s h i p s , f o r example, the quartets ( f o u r v e c t o r s , h Ik, i, and m sum to zero) f o r which new powerful formulas are being developed by Hauptman (40). However, simple extension of t h i s algorithm does not appear to be o p t i m a l , and more research i n t h i s area i s needed. When the phase r e l a t i o n s h i p s and t h e i r p r o b a b i l i t y have been d e r i v e d , several thousand i n c o n s i s t e n t equations i n a few hundred unknowns must be t r a n s l a t e d i n t o a set (or s e t s ) of phases. The procedures used f o r t h i s are very i n t e r e s t i n g , but too s p e c i f i c to c r y s t a l l o g r a p h y to be discussed i n d e t a i l here. One or more s p e c i a l l y chosen phases (depending on the space group) may be assigned " f r e e " to f i x the degrees of freedom. Next the set o f known phases u s u a l l y i s extended by: 1) symbolic a d d i t i o n (41), wherein symbols of unknown value are assigned to a few s e l e c t e d r e f l e c t i o n s , and the set i s extended by a l g e b r a i c manipulations which assign phases as l i n e a r combinations of symbols; or 2) the m u l t i - s o l u t i o n method (42) wherein a l l combinations of p o s s i b l e phase values f o r a few r e f l e c t i o n s are c a r r i e d through the ex t e n s i o n to give a number of p o s s i b l e phase s e t s . The next step i s to rank the phase sets which r e s u l t from the m u l t i - s o l u t i o n method, or from the assignment of numeric phases to the symbols used i n the symbolic a d d i t i o n method; no f o o l p r o o f way to do t h i s has y e t been found. Frequently s e v e r a l , sometimes many, sets o f phases must be t r i e d before a t r i a l s t r u c t u r e i s obtained. With enough perseverance, however, a t r i a l s t r u c t u r e can almost always be obtained by d i r e c t methods using p r e s e n t l y a v a i l a b l e programs. New t h e o r e t i c a l developments i n d i r e c t methods hold promise f o r improved, more automatic computer programs f o r determining
9 2 9 3 3 9
114
Figure 5.
Procedure for finding all pairs of integers with a given sum
5.
DUCHAMP
115
s t a r t i n g phase s e t s . Molecular Mechanics " S t r a i n Energy" C a l c u l a t i o n s : Since molecular mechanics " s t r a i n energy" c a l c u l a t i o n s (43, 44) have become a v a l u a b l e t o o l i n i n t e r p r e t a t i o n of molecular s t r u c t u r e r e s u l t s from c r y s t a l l o g r a p h i c s t u d i e s , c e r t a i n com puting techniques used there w i l l be mentioned. The method i s simple i n p r i n c i p l e ; the s t r a i n energy of a p a r t i c u l a r conforma t i o n of a molecule i s expressed as the sum of terms of several t y p e s , each r e l a t e d to c e r t a i n s t r u c t u r a l parameters; f o r ex ample, bond l e n g t h , non-bonded c o n t a c t s , t o r s i o n angle.
E
strain
bond
angle
torsion
"
( ^
2 1
Each term i s a simple equation i n v o l v i n g one or more e m p i r i c a l l y derived p o t e n t i a l parameters and one or more s t r u c t u r a l para meters. In the usual c a l c u l a t i o n , the s t r u c t u r a l parameters are v a r i e d to minimize the s t r a i n energy, the p o t e n t i a l parameters being held f i x e d . C r y s t a l s t r u c t u r e r e s u l t s are sometimes used to d e r i v e p o t e n t i a l parameters (45, 46). In most s t u d i e s of molecular s t r u c t u r e s t a r t i n g from c r y s t a l l o g r a p h i c r e s u l t s , i t i s useful to c a l c u l a t e the minimum energy f o r the molecule i n the c r y s t a l . U s u a l l y the molecule may be surrounded by i t s nearest neighbors i n the c r y s t a l , and the m i n i m i z a t i o n may be c a r r i e d out by holding the u n i t c e l l para meters f i x e d and varying the atomic p o s i t i o n s , with p r e s e r v a t i o n of space group symmetry. This simple method w i l l produce good r e s u l t s (provided s u i t a b l e p o t e n t i a l parameters are used) i f c a l c u l a t i o n of the minimum energy molecular conformation i s desired. I t w i l l not s u f f i c e i f e i t h e r the u n i t c e l l parameters are to be v a r i e d , i n t e r m o l e c u l a r p o t e n t i a l parameters are to be v a r i e d , o r i f accurate l a t t i c e energies are to be c a l c u l a t e d . For these purposes l a t t i c e sums should be evaluated; a p a r t i c u l a r l y e f f i c i e n t method f o r doing t h i s i s the convergence a c c e l e r a t i o n algorithm of W i l l i a m s (47). In our experience, the i n t r o d u c t i o n o f "extra p o t e n t i a l s " i s a p a r t i c u l a r l y useful technique when molecular conformations other than the minimum energy one must be explored. In t h i s method, p o t e n t i a l s are added which make i t p r o h i b i t i v e l y expen s i v e ( i n energy terms) f o r the molecule not to assume the d e s i r e d s t r u c t u r a l f e a t u r e . The t o t a l e n e r g y - - s t r a i n energy plus "extra p o t e n t i a l " energy--is minimized, g i v i n g the minimum energy conformation of the molecule subject to the c o n s t r a i n t imposed by the " e x t r a p o t e n t i a l s " .
E
total
strain
E e x
tra
116
CHEMICAL
COMPUTATIONS
By s u b t r a c t i n g the s t r a i n energy p o r t i o n of the t o t a l energy from the s t r a i n energy of the molecule i n i t s minimum energy conformation, the c o s t of assuming the non-minimal conformation may be assessed. Many p r o p e r t i e s of molecules may be conveniently s t u d i e d by t h i s technique, i n c l u d i n g : f l e x i b i l i t y of the molecule w i t h respect to a c e r t a i n t o r s i o n angle, b a r r i e r s between conformational minimas, and the f e a s i b i l i t y of c e r t a i n conformations p r e d i c t e d to be " a c t i v e " . One a p p l i c a t i o n we have found espec i a l l y u s e f u l i s the matching of two molecules which are presumed to bind a t the same a c t i v e s i t e . In t h i s procedure, (see Figure 6) two or more molecules are minimized simultaneously w h i l e being l i n k e d at c e r t a i n s e l e c t e d s i t e s by " e x t r a p o t e n t i a l s " . One word o f caution i s appropriate here--"extra p o t e n t i a l s " are u s u a l l y set to be so strong t h a t , without due c a r e , the c a l c u l a t i o n may become unbalanced, causing c e r t a i n m i n i m i z a t i o n techniques to converge q u i t e s l o w l y . This i s p a r t i c u l a r l y true of c e r t a i n "pattern search" r o u t i n e s (48) used i n many programs. M i n i m i z a t i o n techniques are of great importance to both the e f f i c i e n c y of molecular mechanics computer programs, and the accuracy and r e p r o d u c i b i l i t y of the r e s u l t s . The energy express i o n i s n o n - l i n e a r i n the v a r i a b l e s used i n the c a l c u l a t i o n . I f , as i s u s u a l , atomic coordinates are the v a r i a b l e s , the number of v a r i a b l e s i s g r e a t e r than the number of degrees of freedom. The energy surface i s c h a r a c t e r i z e d by many l o c a l minima; and by the f a c t t h a t a minimum i s f r e q u e n t l y q u i t e f l a t f o r c o n s i d e r a b l e d i s t a n c e s i n parameter space. An optimal m i n i m i z a t i o n a l g o r i t h m f o r such problems i s y e t t o be discovered. Methods c u r r e n t l y used i n c l u d e search techniques, which converge from l a r g e d i s tances, but are i n e f f i c i e n t i n f l a t minima, and more complicated methods such as Newton's Method, which works w e l l i n f i n d i n g the minimum but i s extremely time consuming i f the i n i t i a l s t a r t i n g point i s f a r o f f . Automating C r y s t a l l o g r a p h i c C a l c u l a t i o n s : During the course of a c r y s t a l s t r u c t u r e determination a l a r g e number of d i f f e r e n t types of c a l c u l a t i o n s must be performed. P r i o r to the advent of c r y s t a l l o g r a p h i c computing systems, each type was incorporated i n t o a d i f f e r e n t proqram with i t s own p e c u l i a r form of input and output. With the advent of programming systems (49, 50) must of the i n c o m p a t i b i l i t y between programs, and much of the tedium of c r y s t a l l o g r a p h i c computing, was eliminated--how much so depends upon the p a r t i c u l a r system. A reasonable set of goals to s t r i v e f o r i n automating a computing process are: a) b) s i n g l e e n t r y o f data m i n i m i z a t i o n of i n p u t , i n c l u d i n g p r o v i d i n g d e f a u l t s f o r a l l o p t i o n s and not r e q u i r i n g e n t r y of anything the computer can c a l c u l a t e
5.
DUCHAMP
117
c) d) e) f)
m i n i m i z a t i o n of input e r r o r s computer runs u n t i l a "human" d e c i s i o n i s needed minimum e f f o r t f o r a d e c i s i o n minimum e f f o r t to implement d e c i s i o n s
For example, i f the c r y s t a l l o g r a p h i c u n i t c e l l parameters are entered during data r e d u c t i o n c a l c u l a t i o n s , they should not have to be entered again i n any subsequent c a l c u l a t i o n . One of the best examples of not minimizing input i s a computer program which r e q u i r e s the user to enter the number of atoms to be entered, i n s t e a d of counting the atoms as they are entered. Good input e n g i n e e r i n g , i n c l u d i n g the use of a l p h a b e t i c l a b e l s and f r e e format where a p p r o p r i a t e , w i l l minimize input e r r o r s . Factors which can minimize d e c i s i o n making e f f o r t i n c l u d e : o r g a n i z a t i o n of data p e r t i n e n t to the d e c i s i o n i n a short summary form, and presenting i t i n a way i t can be q u i c k l y a s s i m i l a t e d by the user. A f t e r the r e q u i s i t e d e c i s i o n s are made, we can't say to the computer "continue with c a l c u l a t i o n x", but we should s t r i v e t o come as c l o s e as p o s s i b l e to t h i s . There are problems c o m p l i c a t i n g t h i s automation process, some are computer e n g i n e e r i n g , some p r a c t i c a l , and some b a s i c a l l y p h i l o s o p h i c a l . These i n c l u d e : the n e c e s s i t y f o r r e t a i n i n g o p t i o n a l ways of doing the c a l c u l a t i o n s , the need f o r the user t o r e t a i n c o n t r o l of the process, the r e s t r i c t i o n s placed on operat i o n by the v a r i o u s computer systems, a v o i d i n g the waste o f computer time, and the inherent d i f f i c u l t y u s u a l l y encountered i n automating d e c i s i o n making. A c e r t a i n l e v e l of automation of the d e c i s i o n making and d e c i s i o n implementation processes has been achieved i n our l a b o r a t o r y through use of a graphics terminal o n - l i n e to our l a r g e research computer (21). Figure 7 shows the o p e r a t i o n a l hookup. Our graphics programs run i n a high p r i o r i t y p a r t i t i o n i n what i s e s s e n t i a l l y a batch processing system. On-line d i s k l i b r a r i e s are used to pass data between our graphics programs and our r e g u l a r batch c a l c u l a t i o n s which run a t a lower p r i o r i t y . A l l our batch jobs are submitted through the graphics t e r m i n a l , i n c l u d i n g the j o b which t r a n s f e r s the i n i t i a l data from the l a b o r a t o r y automation computer to the l a r g e computer. Any time consuming c a l c u l a t i o n s are run i n batch mode. For example, e l e c t r o n d e n s i t y maps are c a l c u l a t e d i n a batch run, w i t h the r e s u l t s being saved i n a d i s k l i b r a r y ; a graphics program i s used f o r i n t e r p r e t a t i o n of the map s i n c e "human" d e c i s i o n i s u s u a l l y r e q u i r e d . The use o f t h i s graphics terminal has cut the amount of people time r e q u i r e d to run a s e r i e s of c r y s t a l l o g r a p h i c c a l c u l a t i o n s by more than a f a c t o r of two. In the area of input e n g i n e e r i n g , i n the c u r r e n t v e r s i o n o f the CRYM system (developed by the a u t h o r ) , e x c l u d i n g the job c o n t r o l , 15 i n p u t records (card images) are r e q u i r e d i n one batch run to take an i n i t i a l set of data through a v a r i e t y of data r e d u c t i o n c a l c u l a t i o n s , approxiate s c a l i n g , a d i r e c t methods
COMPUTATIONS
Figure 6. Two Steroids19-nor androstenediol (a); and 7-ame-19-nor androstenediol (b)as found in crystal (viewed normal to C ring). Dotted lines in (c) show a possible placement of "extra potentials" for linking the molecules during simultaneous strain energy minimization.
SCOPE
TERMINAL
(X-RAY
LAB)
GRAPHICS PARTITION (HIGH P R I O R I T Y )
(COMPUTER CENTER) IBM 3 7 0 / 1 5 5
/ R P I S / GAHC^ G
\ Q U V H LC L/
0ISK
LIBRARIES
Figure 7.
Graphics system for crystallographic computing at Upjohn
5.
DUCHAMP
119
calculation, and calculation of the most probable -map ready for interpretation on the graphics terminal. Analysis of this input shows that for the case of the morphine free base structure i t could be reduced to the four records shown below for the compu tation of the 4 most probable maps. DATA REDUCTION (MORPHINE), SPACE GROUP = 19 ASYMMETRIC UNIT DIRECT METHODS EMAP,
C17 H21 04
(MORPEMC*)
By use of suitable abbreviations, a shorter form i s possible. DR(M0RPHINE),SG=19 AU DM EM,1-4,(MORPEMC*) Our system does not have this type of input, but i t i l l u s t r a t e s the direction we are headed. I t i s a worthwhile direction for any system of programs with a long lifetime. ABSTRACT This review presents a selection of newer algorithms used in X-ray crystallographic calculations. Some of the material i s not previously published. Areas discussed in detail include: Algorithm design f o r computer-controlled diffractometers, a scheme for computer-aided alignment of X-ray tubes, a procedure for determining precision unit c e l l parameters, a method for scaling intensity data for crystal deterioration, "hidden l i n e " algorithms f o r drawing crystals from face descriptions and for drawing ball and stick molecules, crystallographic use of the "fast Fourier transform" method, use of "extra potentials" in molecular mechanics, and the total automation of the X-ray computing process. Literature Cited 1. 2. "International Tables for X-ray Crystallography", Vol. I , II,III,IV,Kynoch Press, Birmingham. R o l l e t t , J. S., "Computing Methods in Crystallography", Pergamon Press, Oxford (1965). C17 H21 04
JÛ
3. Stout, G. H. and Jensen, L. H., "X-ray Structure Determina t i o n " , Macmillan, New York (1968). 4. Abramowitz, M. and Stegun, I. ., Editors, "Handbook of Mathematical Functions", National Bureau of Standards, Government Printing Office, Washington (1964). 5. Aho, . V., Hopcroft, J . E., and Ullman, J . D., "The Design and Analysis of Computer Algorithms", Addison-Wesley, Reading, Massachusetts, (1974). 6. Furnas, T. C., J r . , "Single Crystal Orienter Instruction Manual", General E l e c t r i c Company, X-ray Department, M i l waukee, (1957). 7. Blessing, R. H., Coppens, P., and Becker, P., J. Applied Crystallography, (1972), 7, 488. 8. Lehmann, M. S. and Larsen, F. K., Acta Cryst, (1974), A29, 216. 9. Lehmann, M. S., J. Applied Crystallography, (1975), 8, 619. 10. Mackenzie, J . K. and Williams, E. J., Acta Cryst, (1973), A29, 201. 11. K i l l e a n , R. C. G., Acta Cryst, (1973), A29, 216. 12. Grant, D. F., Acta Cryst, (1973), A29, 217. 13. Spinrad, R. J . , "Abstracts of the American Crystallographic Association", Clemson, Series 2, 4, 35, (1976). 14. Sparks, R. ., "Abstracts of the American Crystallographic Association", Ottawa, (1970). 15. Jacobson, R. ., J. Applied Crystallography, (1976), 9, 115. 16. Original reference unknown; the author first encountered this method i n a class on advanced x-ray crystallography taught by R. E. Marsh at Caltech. 17. Ref. 1, Vol. II, p. 106. 18. Abrahams, S. C., Acta Cryst, (1973), A29, 111. 19. See for example, L. G. Kelly, "Handbook of Numerical Methods and Applications", p. 66, Addison-Wesley, Reading, Massa chusetts (1967), and the references therein. 20. Sutherland, I. E., Sproull, R. F., and Schumacker, R. ., ComputingSurveys,(1974), 6, 1. 21. Duchamp, D. J . , "Abstracts of American Crystallographic Association", Clemson, Series 2, 4, 20, (1976). 22. Busing, W R. and Levy, . ., Acta Cryst, (1957), 10, 180. . 23. Okaya, Y., IBM Research Report, R.C. 1706, IBM Watson Re search Center, Yorktown Heights, N.Y., (1966). 24. Johnson, C. K., ORTEP, ORNL-3794, Oak Ridge National Labora tory, Oak Ridge, Tennessee (1965). 25. Reference 5, p. 60. 26. Cooley, J . W and Tukey, J . W , Math. C m u . 19, 297. . . opt, 27. Gentleman, W M. and Sande, G., Proceedings of the Fall . Joint Computer Conference, (1966), 563. 28. See reference 1, Vol. II, p. 78, for example. 29. Hubbard, C. R., Quicksall, C. O., and Jacobson, R. ., J. Applied Crystallography, (1972), 5, 234. 30. Ten Eyck, L. F., Acta Cryst, (1973), A29, 183.
5. DUCHAMP Molecular Structure Studies
121
31. 32. 33.
Bantz, D. A. and Zwick, M., Acta Cryst, (1974), A30, 257. Immirzi, ., J. Applied Crystallography, (1973), 6, 246. Lange, S., S t o l l e , U., and Huttner, G., Acta Cryst, (1973), A29, 445. 34. Mallinson, P. R. and Teskey, F. ., Acta Cryst, (1974), A30, 601. 35. Duchamp, D. J., Thesis, California Institute of Technology, p. 82 (1965). 36. Sayre, D., Acta Cryst., (1974), A30, 180. 37. Karle, J . and Hauptman, H., "Solution of the Phase Problem. I. The Centrosymmetric Case", Am. Cryst. Assoc. Monograph No. 3, (1953). 38. Karle, J . and Hauptman, H., Acta Cryst, (1958), 11, 264. 39. Couter, C. L. and Dewar, R. . K., Acta Cryst, (1971), B27, 1730. 40. Hauptman, H., Acta Cryst, (1975), A31, 680. 41. Karle, J and Karle, I. L., Acta Cryst, (1966), 21, 849. 42. Germain, G., Main, P., and Woolfson, M. M., Acta Cryst, (1970), B26, 274. 43. Kitaigorodsky, A. I . , "Molecular Crystals and Molecules", Chapter VII, Academic Press, New York (1973). 44. Engler, . M., Andose, J . D., and von R Schleyer, P., JACS, (1973), 95, 8005. 45. Williams, D. E., Acta Cryst, (1974), A30, 71. 46. Ermer, O. and Lifson, S., JACS, (1973), 95, 4121. 47. Williams, D. E., Trans. Amer. Cryst. Assoc., (1970), 6, 21. 48. See for example, R. Hooke and T. A. Jeeves, J. Assoc. Computing Machinery, (1961), 8, 212. 49. Duchamp, D. J., "Abstracts American Crystallographic Asso c i a t i o n " , Bozeman, Montana, 29 (1964). 50. Stewart, J . M. and High, D. F., ibid.
6
Algorithms in the Computer Handling of Chemical Information
LOUIS J. O'KORN Systems Development Dept., Chemical Abstracts Service, The Ohio State University, Columbus, O H 43210 Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch006
The chemical l i t e r a t u r e emphasizes the d e t a i l e d s t r u c t u r a l c h a r a c t e r i s t i c s o f chemical substances; t h i s paper addresses computer-based algorithms that support the handling o f information about chemical substances. The nature o f problems r e q u i r i n g an a l g o r i t h m i c s o l u t i o n , examples o f s p e c i f i c algorithms to support these s o l u t i o n s , and some o f the c o n t i n u i n g problems are discussed. Since r e p r e s e n t a t i o n a f f e c t s the nature of a l g o r i t h m s , several o f the computer r e p r e s e n t a t i o n s o f a chemical substance are mentioned. For these r e p r e s e n t a t i o n s , algorithm developments that perform i n t e r c o n v e r s i o n , r e g i s t r a t i o n , and s t r u c t u r e searching are d i s c u s s e d . Introduction The techniques utilized i n chemical information handling systems fall i n t o two c a t e g o r i e s -- those which handle the p r o c e s s i n g o f t e x t and those concerned with the p r o c e s s i n g o f chemical substance information. The general t e x t handling processes i n chemical information handling systems are not s u b s t a n t i a l l y d i f f e r e n t from the processes of information handling systems for other scientific disciplines. Although not discussed here, s u b s t a n t i a l development has occurred i n the development of computer-based algorithms for text information handling systems. These computer-based t e x t information handling systems provide for data base c o m p i l a t i o n to support t r a d i t i o n a l p r i n t e d p u b l i c a t i o n and a l s o the s e l e c t i v e dissemination o f the i n f o r m a t i o n . Algorithm development i n the areas o f computer e d i t i n g , data base management, s o r t i n g , computer-based composition, and t e x t searching have been critical to the o v e r a l l development of computer-based primary and secondary p u b l i c a t i o n s systems and t e x t search s e r v i c e s . Results o f these developments are i l l u s t r a t e d in the computer-based information system used at Chemical A b s t r a c t s S e r v i c e (CAS) [1]. Lynch [2] d e s c r i b e s p r i n c i p l e s and techniques for the computer-based information s e r v i c e s and 122
OKORN
Computer Handling of Chemical Information
123
Cuadra [3J provides annual reviews o f developments i n informat i o n handling. It i s the set of methods f o r r e p r e s e n t i n g , s o r t i n g , manip u l a t i n g and r e t r i e v i n g information about chemical substances that d i s t i n g u i s h e s the techniques o f chemical information handl i n g from those o f other d i s c i p l i n e s . Chemical l i t e r a t u r e emphas i z e s the d e t a i l e d s t r u c t u r a l c h a r a c t e r i s t i c s o f chemical substances. T h i s i s i l l u s t r a t e d by the f a c t that f o r the 392,000 documents abstracted i n 1975 i n CHEMICAL ABSTRACTS, 1,514,000 chemical substance index e n t r i e s were generated. Of these chemi c a l substance index e n t r i e s , 368,000 corresponded t o substances which were reported f o r the f i r s t time i n 1975. T h i s paper addresses the computer-based algorithms that support the handling o f chemical substance i n f o r m a t i o n . Since the methods used to represent information about chemical substances are c r i t i c a l to the nature o f the algorithms used, a v a r i e t y o f chemical substance r e p r e s e n t a t i o n systems are p r e sented, along with the v a r i o u s system processes necessary t o handle computer-based f i l e s o f chemical substance i n f o r m a t i o n . The algorithm developments that support these system processes are summarized, and sample algorithms are provided i n the appendix to i l l u s t r a t e supporting system processes i n areas o f r e g i s t r a t i o n , substructure s e a r c h i n g , and i n t e r c o n v e r s i o n s . Lynch and others [4] provide an overview o f p r i n c i p l e s and techniques f o r computer h a n d l i n g o f information on chemical substances, and the c h a r a c t e r i s t i c s o f information h a n d l i n g systems u t i l i z i n g these p r i n c i p l e s and techniques. Representations o f Chemical Substance Information Chemical s t r u c t u r e diagrams are two-dimensional v i s u a l d e s c r i p t i o n s o f a chemical substance and provide an important medium f o r communications between chemists. Employing convent i o n s f o r r e p r e s e n t i n g the three-dimensional s t r u c t u r a l features i n the p l a n e , these s t r u c t u r e diagrams f a l l short o f d e s c r i b i n g geometrical r e a l i t y but they are the accepted way to d e s c r i b e chemical substances. Because s t r u c t u r a l diagrams are d i f f i c u l t to convey both o r a l l y and i n w r i t t e n t e x t , s e v e r a l other r e p r e s e n t a t i o n systems have been developed. Many o f these chemical substance r e p r e s e n t a t i o n systems were developed p r i o r t o , but have been u t i l i z e d i n , computer-based chemical substance informat i o n handling systems. In a d d i t i o n , s e v e r a l r e p r e s e n t a t i o n systems more amenable to a l g o r i t h m i c computer p r o c e s s i n g have been developed. For i n p u t , storage, m a n i p u l a t i o n , and output w i t h i n computer-based systems, a r e p r e s e n t a t i o n o f the chemical substance must be s e l e c t e d . The s e l e c t i o n o f a p a r t i c u l a r r e p r e s e n t a t i o n scheme f o r an information system i s based on the s i z e o f the f i l e s to which i t a p p l i e s , the functions to be performed, the a v a i l a b l e hardware and software, and the d e s i r e d balance between
124
ALGORITHMS
FOR CHEMICAL
COMPUTATIONS
manual and machine processes. The substance r e p r e s e n t a t i o n system i s c r i t i c a l to the nature o f algorithms i n computer-based chemical substance information h a n d l i n g systems. Not a l l r e p r e s e n t a t i o n s are o f equivalent d e s c r i p t i v e power. Two important c h a r a c t e r i s t i c s of a r e p r e s e n t a t i o n are unambiguity and uniqueness. A r e p r e s e n t a t i o n i s unique i f , upon a p p l y i n g the r u l e s o f the system to a chemical substance, only one r e p r e s e n t a t i o n can be d e r i v e d . A r e p r e s e n t a t i o n i s unambiguous i f the r e p r e s e n t a t i o n a p p l i e s to only one chemical substance, although there may be more than one p o s s i b l e r e p r e s e n t a t i o n f o r each chemi c a l substance. For example, i n Figure l a , the systematic name provides a unique, unambiguous r e p r e s e n t a t i o n . The molecular formula, Figure l b , i s a unique but ambiguous r e p r e s e n t a t i o n ; unique because f o r any chemical substance there i s only one molecular formula, but ambiguous because isomers a l s o have t h i s molecular formula. The a r b i t r a r i l y numbered connection t a b l e , Figure l c , provides a non-unique, unambiguous r e p r e s e n t a t i o n . The r e p r e s e n t a t i o n i s unambiguous since i t corresponds to one and only one substance, but i t i s not unique because a l t e r n a t i v e numberings o f the connection t a b l e would r e s u l t i n d i f f e r e n t r e p r e s e n t a t i o n s f o r the same chemical substance (the connection t a b l e r e p r e s e n t a t i o n i s discussed i n more d e t a i l below). In a d d i t i o n to being c a t e g o r i z e d according to t h e i r uniqueness and ambiguity, chemical substance r e p r e s e n t a t i o n s commonly used w i t h i n computer-based systems can be f u r t h e r c l a s s i f i e d as systematic nomenclature, fragment codes, l i n e a r n o t a t i o n s , connection t a b l e s , and coordinate r e p r e s e n t a t i o n s . Systematic Nomenclature. Systematic nomenclature provides a unique, unambiguous r e p r e s e n t a t i o n o f a chemical substance by the a p p l i c a t i o n of a r i g o r o u s set o f systematic nomenclature rules. A r e p r e s e n t a t i o n o f a chemical substance i s constructed by a p p l y i n g these nomenclature r u l e s to combine terms which d e s c r i b e the i n d i v i d u a l r i n g s , c h a i n s , and f u n c t i o n a l groups w i t h i n the chemical substance. Chemical nomenclature provides a r e p r e s e n t a t i o n which can be i n t e r p r e t e d d i r e c t l y by the p r a c t i c i n g chemist, i s g e n e r a l l y s u i t a b l e for o r a l d i s c o u r s e , can be used i n a p r i n t e d index, and i s i n c r e a s i n g l y a v a i l a b l e i n computer-readable f i l e s . Davis and Rush [5, Chapter 8] d e s c r i b e the o r i g i n , development, and examples o f systematic nomenclature systems. Figure 2 provides an example o f systematic nomenclature u t i l i z i n g the CHEMICAL ABSTRACTS NINTH COLLECTIVE INDEX Nomenc l a t u r e Rules [6], The systematic name i n t h i s example i s cyclohexanol, 2-chloro-. I t i s generated by (1) determining the p r i n c i p a l f u n c t i o n a l group, the OH group; (2) determining the r i n g or chain to which i t i s d i r e c t l y attached, cyclohexane; (3) naming the f u n c t i o n a l group and i t s attached r i n g , c y c l o hexanol; and (4) naming a l l other f u n c t i o n a l groups and s k e l e t a l fragments, 2 - c h l o r o , where the locant 2 i d e n t i f i e s the p o i n t o f
O'KORN
125
attachment to the cyclohexane ring. Fragment Codes. Fragment codes are a series of predefined descriptors which are assigned to significant substructural units, e.g., rings or functional groups. A given code i s as signed to a chemical substance i f the structural component occurs within the chemical substance. Typically, fragment codes pro vide a unique, ambiguous description of a chemical substance. With the introduction of punched-card systems, fragment code systems became popular because of the simplicity of representa tion and the ease of the coding and searching operations. Since fragment codes offer only a partial description of a chemical substance based on predefined descriptors, there are situations for which certain substructural components that were not i n i t i ally anticipated and defined cannot be searched and situations of extraneous retrievals of structures containing the needed fragments but not in desired relationships. Although fragment codes are valuable for subclassification of f i l e s , in the case of large f i l e s , fragment codes are usually accompanied by other, more complete representations. Figure 3 provides an example of a fragment code representation utilizing the Ring Code System [7], with codes corresponding to the card columns and punches for the particular characteristic cited. Linear Notation. Linear notation systems use a linear string consisting of a set of symbols to represent complete topo logical descriptions of chemical substances. Each system has symbols which represent atoms or groups of atoms, a syntax to describe interconnections, and rules for ordering the symbols to provide a unique and unambiguous representation of the topo logy of a chemical substance. After deriving a linear notation by applying a set of ordering rules, linear notations are easy to input and require no specialized input equipment. The representation i s very compact and the f i l e structure is simple; also linear notations can be utilized in printed indexes. Davis and Rush [5, Chapter 9] provide general information on linear notation systems and a more detailed discussion of the origin and development of the IUPAC, Wiswesser, Hayward, and Skolnik linear notation systems. Figure 4 provides an example of a representation using Wiswesser Line Notation. For this example, the Wiswesser Line Notation i s L6TJ AQ BG. The ring system is cited f i r s t and i s represented by L6TJ where L indicates the start of a carbocyclic ring, 6 indicates a six-member ring, indicates that the ring is fully saturated, and J indicates the end of the ring system. The substituents CI and O are represented by G and Q, H respectively, and their positions of attachment are identified by the locants A and B. Since Q occurs later than G is the defined collating sequence, Q is cited before G.
126
ALGORITHMS
FOR CHEMICAL
COMPUTATIONS
a. ) Systematic Nomenclature: Benzene, 1,4-dichlorob. ) Molecular Formula: C H C 1

6 4 2
c. ) Connection Table: Atom No. Element Bonds Cl' 1 2 3 4 5 6 7 8 Cl Connections 2 1,3,7 2,4 3,5 4,6,8 5,7 2,6 5
IS
C\8
c c c c c c
Cl
s
S,S,D S,D D,S S,D,S D,S D,S S
Figure 1.
Various representations of the chemical substance
Cl
Cyclohexane -
chloro
tOH <L^ ol Cyclohexanol, 2-chloro-
Figure 2. Representation using systematic nomenclature
a:
Code 2/12 4/1 Characteristic One Isolated Ring One 6-member Fully Saturated Carbocyclic Ring Chlorine Present One OH
Figure 3. Representation using frag- 17/1 ment codes 18/1
Figure 4. Representation using

Wiswesser line notation L6TJ AQ BG
O'KORN
127
Connection T a b l e s . A s t r u c t u r e diagram o f a chemical substance can be viewed as a graph with the nodes corresponding to the non-hydrogen atoms o f the substance and the edges connecting the nodes corresponding to the bonds o f the substance. Given an a r b i t r a r y numbering o f the non-hydrogen nodes o f the graph, the connection t a b l e i s a t a b u l a r d e s c r i p t i o n o f the graph i n which each node i s both l i s t e d i n numerical order and i s d e s c r i b e d by the element symbol and the i n t e r c o n n e c t i o n s of each atom with each other atom are e x p l i c i t l y d e s c r i b e d . Structural details such as charge, abnormal valency, and i s o t o p i c mass can be recorded with each atom. Beyond the atoms and bonds, the connect i o n t a b l e introduces no concepts o f chemical s i g n i f i c a n c e i n t o the r e p r e s e n t a t i o n . Consequently, connection t a b l e s can be input by c l e r i c a l s t a f f with l i t t l e t r a i n i n g . Figure 5 provides an example o f a connection t a b l e . Since a l l i n t e r c o n n e c t i o n s are c i t e d twice, t h i s form i s c a l l e d a redundant connection t a b l e . By numbering the atoms o f a s t r u c t u r e such that once an atom has been numbered, a l l un-numbered atoms d i r e c t l y connected to i t are numbered, and by c i t i n g only connections to lower-numbered atoms, a more compact connection t a b l e can be d e r i v e d . Figure 6 provides an example o f a compact connection t a b l e . Since the i n t e r c o n n e c t i o n between Atom 7 and Atom 8 has not been c i t e d , these attachments, which complete the d e s c r i p t i o n o f the i n t e r connections o f the s t r u c t u r e , are c i t e d i n a f i e l d c a l l e d the ring closure l i s t . Dittmar, Stobaugh, and Watson [8] describe the connection t a b l e u t i l i z e d i n the CAS Chemical R e g i s t r y System. Lefkowitz [9] d e s c r i b e s a concise form o f a connection t a b l e , c a l l e d the Mechanical Chemical Code, which does not e x p l i c i t l y i d e n t i f y the bonds and has a t t r i b u t e s o f both a connection t a b l e and l i n e a r notation. The DARC code [10] resembles a connection t a b l e , s i n c e i t expresses or i m p l i e s the nature o f each atom and bond, but i t i s generated i n a c o n c i s e , l i n e a r form. The d e s c r i p t i o n begins with one atom which i s chosen as the "focus" o f the s t r u c t u r e and then proceeds outward, d e s c r i b i n g the "environment" of the "focus." Coordinate Representation. A coordinate r e p r e s e n t a t i o n o f a chemical substance i s a r e c o r d i n g o f the atoms and bonds o f that substance with an i n d i c a t i o n o f t h e i r r e l a t i v e p o s i t i o n i n a plane. This coordinate r e p r e s e n t a t i o n provides a v a l u a b l e form to f a c i l i t a t e o n - l i n e , r e a l - t i m e manipulation of the s t r u c t u r e diagram and to store the diagram for subsequent composition i n j o u r n a l s , handbooks, and search output. Because t h i s r e p r e s e n t a t i o n i s d i f f i c u l t to manipulate, i t i s t y p i c a l l y converted to some other form for other information system f u n c t i o n s . Farmer and Schehr [11] d e s c r i b e the approaches and c a p a b i l i t i e s used at CAS for r e p r e s e n t i n g and p r o c e s s i n g a coordinate form o f s t r u c t u r e diagrams.
128
F i g u r e 7 g r a p h i c a l l y shows a coordinate r e p r e s e n t a t i o n o f the chemical s t r u c t u r e diagram.^Êvery i d e n t i f i a b l e s u b s t r u c t u r a l u n i t has a node, symbolized by , a s s o c i a t e d with i t . The node corresponding to the complete s t r u c t u r e diagram i s the root node and i s the o r i g i n of the coordinate system. Every atom ( i n c l u d ing i m p l i e d carbonsl and bond o f the s t r u c t u r e has a l e a f , symb o l i z e d by I 1 , a s s o c i a t e d with i t . In the s t r u c t u r e diagram, a l e a f contains the characters for the element symbols or the l i n e d e f i n i t i o n s for the bonds and t h e i r coordinates to i n d i c a t e the p o s i t i o n i n the p l a n e . Coordinate data f o r a l e a f or node are r e l a t i v e to i t s parent node. Thus i t i s p o s s i b l e to change the coordinates o f an e n t i r e subtree by changing the coordinates of the parent. Processes Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch006 The a b i l i t y to i d e n t i f y and c o l l e c t a l l information about a p a r t i c u l a r chemical substance at one point i s e s s e n t i a l to computer-based chemical information h a n d l i n g systems. This e l i m i nates the redundancy o f work, e . g . , i n b i o l o g i c a l t e s t i n g ; i t permits e f f e c t i v e indexing o f chemical substance information and i t allows one to determine i f a substance has been p r e v i o u s l y synthesized. The data base r e s u l t i n g from these processes can a l s o be u t i l i z e d for the i d e n t i f i c a t i o n o f those substances with common s t r u c t u r a l c h a r a c t e r i s t i c s . With the v a r i e t y o f chemical substance r e p r e s e n t a t i o n systems, the a b i l i t y to i n t e r c o n v e r t between r e p r e s e n t a t i o n s allows f l e x i b i l i t y i n performing system functions and permits the interchange o f information among v a r i o u s chemical substance information handling systems. The system processes and a l g o r i t h m development to support these processes are d e s c r i b e d below. Registration. The r e g i s t r a t i o n o f a chemical substance i s the set o f data management procedures which enables a l l informat i o n r e l a t i n g to a s p e c i f i c chemical substance to be l i n k e d together. The r e g i s t r a t i o n procedure i s concerned with d e t e r mining i f a p o t e n t i a l l y new substance i s equivalent to a substance a l r e a d y on f i l e or i f i t i s new, i n which case the substance i s added to the f i l e . The r e g i s t r a t i o n procedure used i s determined by whether the s t r u c t u r a l r e p r e s e n t a t i o n i s both unique and unambiguous. In systems without a unique and unambiguous r e p r e s e n t a t i o n o f a chemical substance, the unique and unambiguous i d e n t i f i c a t i o n i s accomplished through the r e g i s t r a t i o n processes. I n i t i a l l y , the f i l e o f substances i s p a r t i t i o n e d i n t o small groups o f substances on the b a s i s o f unique and ambiguous c h a r a c t e r i s t i c s . For a p o t e n t i a l l y new substance, i t s unique and ambiguous c h a r a c t e r i s t i c s are i d e n t i f i e d and f i n a l determination o f whether the candidate substance i s new i s made by d i r e c t atom-by-atom s t r u c t u r e comparison o f the candidate with the subgroup o f the
OKORN
129
e x i s t i n g substances that have the same c h a r a c t e r i s t i c s . The s e l e c t i o n o f c h a r a c t e r i s t i c s f o r the p a r t i t i o n i n g i s obviouslyc r i t i c a l , because the e f f e c t i v e n e s s o f t h i s r e g i s t r a t i o n t e c h nique i s dependent on l i m i t i n g the s i z e of the subgroups. T h i s technique i s c a l l e d the isomer s o r t - r e g i s t r a t i o n technique. Brown and others [12] d e s c r i b e the Merck, Sharp, and Dohme chemi c a l s t r u c t u r e information system which u t i l i z e s t h i s approach. In systems that use a unique and unambiguous r e p r e s e n t a t i o n , determining i f a p o t e n t i a l l y new substance i s already on f i l e reduces to the comparison o f the unique, unambiguous r e p r e s e n t a t i o n o f candidate substance to the unqiue, unambiguous r e p r e s e n t a t i o n o f the substances p r e v i o u s l y on f i l e . With l i n e a r n o t a t i o n s , the unique, unambiguous r e p r e s e n t a t i o n i s t y p i c a l l y achieved through manual encoding o f the chemical substance. Eakin [13] d e s c r i b e s the chemical s t r u c t u r e information system at Imperial Chemical I n d u s t r i e s L t d . , where r e g i s t r a t i o n i s based on Wiswesser Line N o t a t i o n . For connection t a b l e s , the unique, unambiguous r e p r e s e n t a t i o n i s d e r i v e d a u t o m a t i c a l l y , i . e . , a s i n g l e , i n v a r i a n t numbering o f the connection t a b l e i s algorithmically derived. The algorithm used i n the CAS Chemical R e g i s t r y System to generate a unique, unambiguous r e p r e s e n t a t i o n from an a r b i t r a r i l y numbered connection t a b l e [14] i s d e s c r i b e d i n a l a t e r s e c t i o n . Dittmar, Stobaugh, and Watson [8] provide a d e s c r i p t i o n o f the general design o f the CAS chemical s t r u c t u r e information system which u t i l i z e s a unique, unambiguous connection t a b l e . Substructure Searching. R e g i s t r a t i o n , as d e s c r i b e d i n the previous s e c t i o n , i s a form o f f u l l - s t r u c t u r e s e a r c h i n g . A l though the r e g i s t r a t i o n process i s concerned with determining i f a complete s t r u c t u r e e x i s t e d p r e v i o u s l y w i t h i n a c o l l e c t i o n , the data base r e s u l t i n g from the r e g i s t r a t i o n processes can be used f o r other purposes, i n p a r t i c u l a r for substructure searching. Substructure searching i s the i d e n t i f i c a t i o n o f a l l substances w i t h i n a f i l e which c o n t a i n a given p a r t i a l s t r u c t u r e . Although s u b s t a n t i a l a t t e n t i o n has been given to substructure s e a r c h i n g , s e v e r a l problems s t i l l remain, p a r t i c u l a r l y i n the o n - l i n e substructure searching o f l a r g e f i l e s , i . e . , those that c o n t a i n more than a m i l l i o n substances. With the v a r i e t y of chemical substance r e p r e s e n t a t i o n s , i . e . , fragment codes, systematic nomenclature, l i n e a r n o t a t i o n s , and connection t a b l e s , a d i v e r s i t y o f approaches and techniques are used for substructure s e a r c h i n g . Whereas unique, unambiguous r e p r e s e n t a t i o n s are e s s e n t i a l f o r some r e g i s t r a t i o n p r o c e s s e s , i t i s important to note that t h i s often cannot be used to advantage i n substructure s e a r c h i n g . With connection t a b l e s , there i s no assurance that the atoms c i t e d i n the substructure w i l l be c i t e d i n the same order as the corresponding atoms i n the structure. With nomenclature or n o t a t i o n r e p r e s e n t a t i o n systems, a s u b s t r u c t u r a l u n i t may be d e s c r i b e d by d i f f e r e n t terms or
130
symbols i n the complete s t r u c t u r e because of the context i n which the s u b s t r u c t u r a l u n i t appears. Fragment code systems, devised to permit r e t r i e v a l of a chemical s t r u c t u r e i n a v a r i e t y o f ways, p r e v i o u s l y u t i l i z e d manually d e r i v e d codes which were stored on and searched from punched cards. With the development o f computer techniques, many o f the e a r l y systems were expanded to permit the storage and search o f a wide v a r i e t y o f more complex codes. The f r a g ments may correspond to general s p e c i f i c or s t r u c t u r a l features and are often organized to allow searching at any l e v e l o f specificity. Search questions are stated i n terms of the f r a g ments used f o r r e p r e s e n t a t i o n and thus r e t r i e v a l s c o n s i s t o f a l l substances c o n t a i n i n g the r e q u i r e d fragments. Because the a d d i t i o n of new s t r u c t u r a l features r e q u i r e s the r e - a n a l y s i s of the p r e v i o u s l y processed f i l e , a t t e n t i o n has been given to the automatic d e r i v a t i o n o f fragment codes from an unambiguous substance r e p r e s e n t a t i o n . The development o f the Gremas fragment code system at I n t e r n a t i o n a l Documentation i n Chemistry [15] was o r i g i n a l l y based on manually derived fragment codes but has subsequently been expanded to generate the codes from connection t a b l e s and t o p o l o g i c a l d e s c r i p t i o n s that have been input by an o p t i c a l scanning d e v i c e . C r a i g [16] d e s c r i b e s the fragment codes r e t r i e v a l system used by Smith, K l i n e , French L a b o r a t o r i e s . With the i n c r e a s i n g a v a i l a b i l i t y o f computer-readable f i l e s of systematic nomenclature and c a p a b i l i t i e s for t e x t s e a r c h i n g , a t t e n t i o n has been given to the development of substructure searching o f f i l e s o f systematic nomenclature u s i n g search terms that are also systematic nomenclature terms. F i s a n i c k and others [17] d e s c r i b e an i n v e s t i g a t i o n i n t o nomenclature-based subs t r u c t u r e searching u s i n g techniques and search aids developed at CAS. Substructure searching based on l i n e a r notations can be accomplished i n both an automated and non-automated mode. Dyson [18] d e s c r i b e s a computer-produced permuted index that supports the manual searching o f the Dyson-IUPAC Linear Notation for subs t r u c t u r a l components. Computer-based substructure searching of a l i n e a r n o t a t i o n involves examining the symbols o f the l i n e a r n o t a t i o n to determine i f the s u b s t r u c t u r a l features e x i s t . Granito and G a r f i e l d [19] contrast substructure r e t r i e v a l systems based on fragment codesT" connection t a b l e s , and l i n e a r n o t a t i o n s . In a d d i t i o n , they d e s c r i b e a p p l i c a t i o n s o f Wiswesser Line Notat i o n at the I n s t i t u t e for S c i e n t i f i c Information that support substructure searching, r e g i s t r a t i o n , structure/property r e l a t i o n s h i p s t u d i e s , and d i s p l a y . Lynch and others [4, Chapter 5] d e s c r i b e techniques and c o n s i d e r a t i o n for the computer-based searching o f l i n e a r n o t a t i o n s . As with nomenclature substructure searches, the success o f a substructure search o f l i n e a r n o t a t i o n depends d i r e c t l y on the a b i l i t y o f the questioner to a n t i c i p a t e the environment o f the r e q u i r e d fragment i n v a r i o u s structures.
KORN
131
Depending on the s o p h i s t i c a t i o n needed, substructure search ing can be accomplished with a v a r i e t y o f the representations o f a chemical substance. Some substructure searches can only be adequately answered by a complete atom-by-atom and bond-by-bond search for which a connection t a b l e , with i t s e x p l i c i t d e s c r i p tion of f u l l structural d e t a i l , i s essential. There are two approaches to the atom-by-atom substructure search o f a connection t a b l e : i t e r a t i v e atom-by-atom search [20] and the Sussenguth set r e d u c t i o n technique [21]. Because each o f these s p e c i f y a l t e r n a t i v e atoms and bonds"~Tnd a l t e r n a t i v e subgroups, there i s v i r t u a l l y no l i m i t to the degree of g e n e r a l i t y or s p e c i f i c i t y of the search. The i t e r a t i v e atom-by-atom search involves l o c a t i n g the l e a s t commonly o c c u r r i n g atom i n the substructure and searching for each other atom o f the substructure by p a t h - t r a c i n g . When a non-match i s found, searching i s continued by backing up to the most recent branch point and proceeding along another p a t h . This i t e r a t i v e procedure i s continued u n t i l the substructure i s found or the whole s t r u c t u r e has been examined without f i n d i n g the substructure. The Sussenguth set r e d u c t i o n technique involves p a r t i t i o n i n g the atoms o f both the substructure and the s t r u c t u r e based on the atoms, bonds, and i n t e r c o n n e c t i o n s . The technique i n v o l v e s generating subsets o f atoms f o r the s t r u c t u r e and the subsets o f atoms for the s u b s t r u c t u r e , based on the elements, bond v a l u e s , and number o f attachment. For example, a l l carbon atoms would be i n the same subset, a l l atoms with s i n g l e bonds attached would be i n the same subset, e t c . These subsets would then be further p a r t i t i o n e d by i n t e r s e c t i n g p a i r s o f subsets - e . g . , a l l carbons with s i n g l e bonds attached would be i n a sub s e t , a l l carbon with double bonds attached would be i n the same subset, e t c . A d d i t i o n a l subsets would then be generated u s i n g the connections o f each atom, and f u r t h e r p a r t i t i o n i n g would be attempted. These processes for p a r t i t i o n i n g and generating sets lead to one o f the f o l l o w i n g s i t u a t i o n s : (1) a complete c o r r e spondence between each atom i n the substructure and the s t r u c t u r e , i n which case the substructure i s contained w i t h i n the s t r u c t u r e ; or (2) a non-correspondence between each atom o f the substructure and the s t r u c t u r e , i n which case the substructure i s not contained w i t h i n the s t r u c t u r e ; or (3) a s i t u a t i o n i n which no d i r e c t correspondence can be found, because e i t h e r the p r o p e r t i e s used to p a r t i t i o n the atoms were not powerful enough to d i s t i n g u i s h between each atom or there i s more than one correspondence between the substructure and s t r u c t u r e . In the t h i r d case, the v a r i o u s a l t e r n a t i v e s for the correspondence between substructure and s t r u c t u r e must be t r i e d , thus l e a d i n g to the correspondence or a c o n t r a d i c t i o n . Both of these approaches to substructure searching o f a connection t a b l e are extremely time-consuming, and i t i s u s u a l l y necessary for economic reasons to use some form o f screening
132
ALGORITHMS FOR CHEMICAL COMPUTATIONS Atom No. Elements 1 Bonds Connections s,s s,s s,s s,s s,s,s s s,s,s s 2,7 1,3 2,4 3,5 4,6,7 5 1,5,8 7
;ct:
OH
2 3 4 5 6 7 8
C C c c c c CI
Figure 5.
Representation using connection table

Atom No. Attachment 1 Element Bond C I I 1 2 2 3 S 7/8 C Cl C C 4
5
5 j ^ N ^ C I *
C C O
S S S S S S S S
"SÔH
6 7 8 Ring Closure
Figure 6.
Representation using compact connec tion table
Figure 7. Coordinate representation of struc ture diagram

Simple Pairs Augmented Pairs (where the connectivities of the atoms are included) Bonded Pairs (where the bond values CC
2 C C1
C-
Figure 8.
Bond-centered fragments
for attachments are included)
OKORN
133
system. In f a c t , i t may be necessary to develop some form o f a screening system for large f i l e s , r e g a r d l e s s o f the r e p r e s e n t a t i o n system. Screening i s the f i r s t stage o f a substructure search and i s intended to i n e x p e n s i v e l y e l i m i n a t e a large number o f s t r u c t u r e s which do not meet the requirements o f a p a r t i c u l a r substructure search q u e s t i o n . Screens are c h a r a c t e r i s t i c s which can be i d e n t i f i e d i n the substances i n a f i l e ; they are s i m i l a r to fragment codes but u s u a l l y c o n s i s t o f computer-generated data o f s t r u c t u r a l s i g n i f i c a n c e (elements, bonds, counts, small subs t r u c t u r a l u n i t s ) r a t h e r than the nomenclature and f u n c t i o n data used i n fragment code systems. A f t e r the screens are generated for a p a r t i c u l a r s u b s t r u c t u r e , the screen search i s c a r r i e d out to s e l e c t a l l s t r u c t u r e s which c o n t a i n the c h a r a c t e r i s t i c s necessary for a p a r t i c u l a r s u b s t r u c t u r e , thus minimizing the number of compounds r e q u i r i n g a d e t a i l e d search. In the s e l e c t i o n o f a screening system, the determination o f the set o f s t r u c t u r a l c h a r a c t e r i s t i c s to act as the screens i s a major problem. A proper balance must be e s t a b l i s h e d between the cost o f generating, s t o r i n g , and searching the screens, and i n s u r i n g that the searches at the screen l e v e l achieve complete recall. In a d d i t i o n , the s t r u c t u r a l c h a r a c t e r i s t i c s s e l e c t e d as screens should occur with a d i s t r i b u t i o n as even as p o s s i b l e . Because o f the uneven d i s t r i b u t i o n o f s t r u c t u r a l c h a r a c t e r i s t i c s , t h i s represents a s i g n i f i c a n t problem. Adamson and others [22] account f o r d i s p a r a t e frequencies o f c h a r a c t e r i s t i c s i n chemical s t r u c t u r e s by employing screens at different levels of d e t a i l s . The screens for frequent c h a r a c t e r i s t i c s are generated at a s u b s t a n t i a l l e v e l o f d e t a i l whereas l e s s common c h a r a c t e r i s t i c s are c a r r i e d i n more general terms. For t h i s approach, the set o f screens are chosen on the b a s i s o f the a t t r i b u t e s and the s i z e o f the f i l e . The screens thus sel e c t e d are based on bond-centered fragments with three d i f f e r e n t l e v e l s o f d e t a i l as i l l u s t r a t e d i n Figure 8. The most commonly o c c u r r i n g p a i r s o f atoms i n the f i l e are included as screens among the simple p a i r s . For a sample f i l e of 30,000 s t r u c t u r e s from the CAS Chemical R e g i s t r y System, 18 simple p a i r s were included. The most f r e q u e n t l y o c c u r r i n g simple p a i r s were i n c l u d e d as augmented p a i r s screens. For the p a r t i c u l a r f i l e s t u d i e d , the augmented p a i r s a l l i n v o l v e d carbon attached to carbon (CC), carbon attached to n i t r o g e n (CN), or carbon attached to oxygen (CO). The most frequently o c c u r r i n g augmented p a i r s were included as bonded p a i r screens; again these i n v o l v e d only CC, CN, or CO. The t o t a l set o f screens c o n s i s t e d o f (1) the number o f common s t r u c t u r a l f e a t u r e s , e . g . , the number o f carbon atoms, the number o f atoms with c o n n e c t i v i t y equal to 3, or the number o f double-chain bonds; (2) b i t s to i n d i c a t e the presence or absence o f various atoms; (3) b i t s to i n d i c a t e the presence or absence o f the 18 most common simple p a i r s o f atoms; (4) b i t s to i n d i c a t e the presence or absence o f the augmented p a i r s ; (5) b i t s to i n d i c a t e the presence or absence o f bonded p a i r s , and
134
ALGORITHMS
FOR CHEMICAL
COMPUTATIONS
(6) b i t s to i n d i c a t e the presence or absence o f various r i n g systems. A d e s c r i p t i o n o f the a l g o r i t h m t h a t generates these screens i s provided i n a l a t e r s e c t i o n . To achieve an even d i s t r i b u t i o n o f screens and a wider v a r i a t i o n i n fragment s e l e c t i o n , Feldman and Hodes [23] have developed a screen generation procedure for use i n the chemical s t r u c t u r e search system at the Walter Reed Army I n s t i t u t e o f Research. The screens s e l e c t e d are based on frequency s t a t i s t i c s from a sample o f the t o t a l base. The process i n v o l v e s "growing" fragments f o r each s t r u c t u r e from a subset o f t h e i r f i l e by s t a r t i n g with each atom and then adding s i n g l e atoms at each i t e r a t i o n to the fragments generated during the previous i t e r a tion. T h i s process would generate a l l p o s s i b l e fragments. To keep the number of fragments at a reasonable number, an e l i m i n a t i o n r u l e based on the frequency o f occurrence of that fragment w i t h i n the sample f i l e i s a p p l i e d . T h i s r u l e determines which fragments are to be e l i m i n a t e d (those which occur at a frequency of l e s s than 0.1%), and which fragments are to be passed on to the next i t e r a t i o n (those which occur at a frequency of greater than 1%), where they w i l l "grow" f u r t h e r . In a d d i t i o n , a h e u r i s t i c procedure based on e a r l i e r o p e r a t i o n a l experience was used to "prune" a large number of fragments which were c h e m i c a l l y insignificant. The fragments obtained at the completion o f t h i s i t e r a t i v e process were then used as screens. Interconversion. With the v a r i e t y o f r e p r e s e n t a t i o n s , the approach taken i n s e l e c t i o n o f a chemical substance r e p r e s e n t a t i o n has not been to s e l e c t one r e p r e s e n t a t i o n to handle a f u l l range o f f u n c t i o n s , but r a t h e r , through automatic i n t e r c o n v e r s i o n , to u t i l i z e the r e p r e s e n t a t i o n which best solves a p a r t i c u l a r problem or meets a p a r t i c u l a r set o f p r o c e s s i n g r e q u i r e ments for a given information system. In a d d i t i o n to p r o v i d i n g t h i s i n t e r n a l f l e x i b i l i t y , automatic i n t e r c o n v e r s i o n permits interchange o f information among systems u s i n g various s t r u c t u r e representations. Granito [24] d i s c u s s e s the needs and s t a t u s o f i n t e r c o n v e r s i o n s among chemical substance information systems. Campey, Hyde, and Jackson [25] i l l u s t r a t e a chemical s t r u c t u r e information system which uses a v a r i e t y o f r e p r e s e n t a t i o n s . S u b s t a n t i a l a t t e n t i o n and progress has been made i n the development o f procedures to effect conversion between chemical substance r e p r e s e n t a t i o n s . Zamora and Davis [26] d e s c r i b e an algorithm to convert a coordinate r e p r e s e n t a t i o n o f a chemical substance (derived from input by a chemical typewriter) to a connection t a b l e . An approach f o r i n t e r a c t i v e input o f a s t r u c t u r e diagram and conversion o f t h i s r e p r e s e n t a t i o n to a connection t a b l e s u i t a b l e for substructure searching i s d i s c u s s e d by Feldmann [27]. The conversion o f systematic nomenclature to connection t a b l e s o f f e r s a powerful e d i t i n g t o o l as w e l l as a p o t e n t i a l mechanism f o r c o n v e r s i o n o f name f i l e s to connection t a b l e s ; t h i s type o f conversion i s d e s c r i b e d by Vander Stouw [28].
OKORN
135
Programs now e x i s t to convert Wiswesser Line Notation [29], Hayward [30], and IUPAC [18] l i n e a r notations to connection t a b l e s . Because fragment codes alone do not provide the complete d e s c r i p t i o n of a l l s t r u c t u r a l d e t a i l , conversion to other representations i s t y p i c a l l y not p o s s i b l e . The conversion from a connection t a b l e to other unambiguous r e p r e s e n t a t i o n s i s s u b s t a n t i a l l y more d i f f i c u l t . The connection t a b l e i s the l e a s t s t r u c t u r e d r e p r e s e n t a t i o n and incorporates no concepts of chemical s i g n i f i c a n c e beyond the l i s t o f atoms, bonds, and connections. A complex set o f r u l e s must be a p p l i e d i n order to derive nomenclature and l i n e a r n o t a t i o n r e p r e s e n t a t i o n s . To t r a n s l a t e from these more s t r u c t u r e d representations to a connection t a b l e r e q u i r e s p r i m a r i l y the i n t e r p r e t a t i o n o f symbols and syntax. The opposite conversion, from the connection t a b l e to l i n e a r n o t a t i o n , nomenclature, or coordinate r e p r e s e n t a t i o n f i r s t r e q u i r e s the d e t a i l e d a n a l y s i s o f the connection t a b l e to i d e n t i f y appropriate s u b s t r u c t u r a l u n i t s . The complex o r d e r i n g r u l e s o f the nomenclature or n o t a t i o n system or the e s t h e t i c r u l e s f o r graphic d i s p l a y are then a p p l i e d to derive the d e s i r e d representation. Ebe and Zamora [31], b u i l d i n g on algorithms that generate Wiswesser Line Notation f o r r i n g systems from a connection t a b l e [32] , have developed procedures to employ these i n t e r c o n v e r s i o n s for e d i t i n g Wiswesser Line Notations f o r complex r i n g systems. F a r r e l l , Chauvenet, and Koniver [33] d e s c r i b e procedures f o r generating Wiswesser Line Notation from connection t a b l e s and L e f k o v i t z [9] describes the d e r i v a t i o n o f Mechanical Chemical Code, a concise form o f a connection t a b l e from the CAS connection table. Programs have a l s o been developed to d e r i v e a DARC code from both connection t a b l e s and l i n e a r n o t a t i o n s . Algorithms for generation o f systematic nomenclature from a connection t a b l e are c u r r e n t l y being developed by CAS. Because the s t r u c t u r e diagram i s a d e s i r a b l e form o f output from an automated chemical s t r u c t u r e information handling system, s e v e r a l algorithms have been developed to generate a coordinate r e p r e s e n t a t i o n from a connection t a b l e [34 and 35]. However, most s t r u c t u r e d i s p l a y systems were developed f o r a chemical typew r i t e r or l i n e p r i n t e r , and the p h y s i c a l c h a r a c t e r i s t i c s o f these devices r e s t r i c t the complexity o f s t r u c t u r e s to be d i s played. An algorithm for a general C a r t e s i a n coordinate system, which produces s t r u c t u r e diagrams of high g r a p h i c a l q u a l i t y from a connection t a b l e r e p r e s e n t a t i o n , has been developed and u t i l i z e d at CAS and i s d e s c r i b e d by Dittmar and Mockus [36]. In a l a t e r s e c t i o n , an example i s provided to i l l u s t r a t e features o f this algorithm. Related Continuing Developments A v a r i e t y o f algorithms f o r the computer h a n d l i n g o f chemical s t r u c t u r e information have been d e s c r i b e d . The techniques f o r
136
r e p r e s e n t a t i o n and p r o c e s s i n g have become e s t a b l i s h e d , and, as i n d i c a t e d by the existence o f e f f e c t i v e o p e r a t i o n a l systems [4, Chapters 8 and 9] and some algorithms presented e a r l i e r , p r a c t i c a l s o l u t i o n s e x i s t f o r many o f the problems i n the hand l i n g o f chemical s t r u c t u r e s . Several of the general graph theory problems are p r e s e n t l y unsolved. An example i s subgraph isomorphism: given two graphs, Gi and G2, i s G\ isomorphic to a subgraph of G2? It i s con j e c t u r e d t h a t no algorithm f o r s o l v i n g i t i n polynomial time e x i s t s , i . e . , a l l known algorithms have at l e a s t an exponential growth r a t e based on the number o f v e r t i c e s for some subset o f graphs. Another example i s general graph isomorphism: given two graphs, Gi and G2, i s Gi isomorphic to G2? T h i s problem i s a l s o unsolved and i s a s p e c i a l case o f the subgraph isomorphism problem [37]. For various c l a s s e s o f graphs, as i n the case o f p l a n a r graphs [38], isomorphism algorithms have been found. Sanders [39] demonstrates that the a l g o r i t h m i c generation o f Wiswesser Line Notation i s not polynomial bounded. As i l l u s t r a t e d e a r l i e r , good h e u r i s t i c procedures have been e s t a b l i s h e d to provide s o l u t i o n s to isomorphism problems f o r the graphs corresponding to chemical s t r u c t u r e s . However, the general graph theory problems remain and are r e c e i v i n g continued a t t e n tion. Algorithms that process s t r u c t u r a l data o f chemical sub stances are being developed for many areas. For example, structure/property c o r r e l a t i o n [40] u t i l i z e s a chemical substance data base to provide a c o r r e l a t i o n between b i o l o g i c a l p r o p e r t i e s and s t r u c t u r a l features o f chemical substances. Reactants and products o f chemical r e a c t i o n s can be analyzed to provide r e t r i e v a l o f information about p a r t i a l s t r u c t u r e s that c h a r a c t e r i z e the r e a c t i o n [41]. Among the computer programs t h a t have been developed for u t i l i z i n g chemical s t r u c t u r e i n f o r mation are molecular modeling programs [42], aimed at u s i n g the computer to generate a c t u a l three-dimensional d e s c r i p t i o n s o f chemical substances, and organic s y n t h e s i s programs [43], which p r e d i c t by computer the design o f p o s s i b l e s y n t h e t i c routes to a given t a r g e t substance. APPENDIX Sample Algorithms I l l u s t r a t i v e sample algorithms that support system processes i n areas o f r e g i s t r a t i o n , substructure s e a r c h i n g , and automatic i n t e r c o n v e r s i o n are provided below. Algorithm I - R e g i s t r a t i o n - C a n o n i c a l ! z a t i o n o f Connection Tables. A connection t a b l e f o r a chemical substance with atoms can be numbered i n as many as n ! d i f f e r e n t ways. The problem o f generating a c a n o n i c a l form involves s e l e c t i n g a
O'KORN
137
s i n g l e and i n v a r i a n t numbering o f the connection t a b l e . An approach would be to generate a l l n ! r e p r e s e n t a t i o n s , s o r t them a l p h a b e t i c a l l y , and then s e l e c t the one which compares low. Except for very small n , t h i s procedure i s o b v i o u s l y not f e a s i b l e . The approach presented below i s a v a r i a t i o n o f t h i s procedure, and l i m i t s the number o f r e p r e s e n t a t i o n s that must be generated by e s t a b l i s h i n g a p a r t i a l order o f atoms, r e s t r i c t i n g the num b e r i n g p e r m i t t e d , and saving the r e s u l t s o f the path t r a c i n g . Given an a r b i t r a r i l y numbered connection t a b l e representa t i o n o f a s t r u c t u r e with non-hydrogen atoms, the unique number ing i s obtained as f o l l o w s : 1. Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch006 Assign Stage 1 c o n n e c t i v i t y values to each atom based on the number o f attachments to the atoms. Assign Stage 2 c o n n e c t i v i t y values to each atom by summing the Stage 1 c o n n e c t i v i t y values for the attached atoms. Given the Stage i c o n n e c t i v i t y values f o r each atom, a s s i g n the Stage i+1 c o n n e c t i v i t y values by summing the Stage i c o n n e c t i v i t y values for the attached atoms. C a l c u l a t e the number o f d i s t i n c t c o n n e c t i v i t y values at the Stage i and Stage i+1. I f the number o f d i s t i n c t c o n n e c t i v i t y values at Stage i+1 i s greater than Stage i , go to step 3. Otherwise, the f i n a l c o n n e c t i v i t y values are Stage i v a l u e s . S e l e c t the atom with the highest c o n n e c t i v i t y and designate that atom as Number 1. the the
2.
3.
4.
5.
6.
7.
value
8.
Since Steps 1-6 provide only a p a r t i a l order o f the atoms, note a l l other atoms with same c o n n e c t i v i t y value. Atoms connected to Atom 1 are assigned 2, 3, e t c . based on decreasing c o n n e c t i v i t y v a l u e s . I f a choice i s a r b i t r a r y (where the atoms have the same c o n n e c t i v i t y value) note the p a i r s of atoms i n v o l v e d i n the arbitrary choice. The unnumbered atoms attached to Atom 2 are numbered based on decreasing c o n n e c t i v i t y v a l u e s . A g a i n , note p a i r s o f atoms where the choice was a r b i t r a r y .
9.
10.
138 11.
ALGORITHMS
FOR CHEMICAL
COMPUTATIONS
This procedure i s followed u n t i l a l l atoms have been numbered. B u i l d and r e t a i n the compact connection t a b l e based on t h i s numbering. Back up to the highest numbered atom for which the choice was a r b i t r a r y . I f there are no remaining atoms where the choice was a r b i t r a r y , the process i s complete and the r e t a i n e d connection t a b l e i s the unique r e p r e sentation. S e l e c t the other atom from the p a i r i n v o l v e d i n the a r b i t r a r y choice and renumber the atoms o f the s t r u c t u r e from that atom to the l a s t atom. B u i l d a new compact connection t a b l e . Compare the newly generated compact connection t a b l e to the r e t a i n e d compact connection t a b l e . I f the new connection t a b l e i s a l p h a b e t i c a l l y than the r e t a i n e d t a b l e , r e p l a c e the r e t a i n e d with the new t a b l e , and go to Step 13. Otherwise, go to Step 13. less table
12.
13.
14. Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch006
15. 16.
17.
18.
Figure 9 i l l u s t r a t e s the steps i n the algorithm for genera t i n g the unique connection t a b l e . F i g u r e 9a i l l u s t r a t e s the Stage 1 c o n n e c t i v i t y values which are the number o f attachments, and Figure 9b i l l u s t r a t e s the Stage 2 c o n n e c t i v i t y values which are obtained by summing the Stage 1 c o n n e c t i v i t y values for the attached atoms. At Stage 2, the number o f d i s t i n c t values i s 4. The Stage 3 c o n n e c t i v i t y values are obtained by summing the Stage 2 c o n n e c t i v i t y values for the attached atoms, as i l l u s t r a t e d i n Figure 9 c . Since i n Stage 3 the number o f d i s t i n c t values i s 6, which i s greater than the Stage 2 value o f 4, the i t e r a t i v e p r o cess i s continued. Figure 9d i l l u s t r a t e s the Stage 4 c o n n e c t i v i t y value c a l c u l a t i o n s . Since the number o f d i s t i n c t values at Stage 4 i s equal to that at Stage 3, the f i n a l c o n n e c t i v i t y values assigned are those c a l c u l a t e d at Stage 3. Figure 9e i l l u s t r a t e s the i n i t i a l numbering and the compact connection t a b l e u s i n g the Stage 3 c o n n e c t i v i t y v a l u e s . The atom with c o n n e c t i v i t y value of 13 i s assigned Number 1. The atoms attached to Atom 1 are numbered 2, 3, and 4 based on decreasing connectivity values. The a r b i t r a r y choice between 3 and 4 i s noted. The unnumbered attachment to Number 2 i s assigned Number 5. The unnumbered attachments to Atoms 3, 4, 5 are numbered. Based on t h i s numbering, the i n i t i a l connection t a b l e i s constructed
OKORN
Computer Handling of Chemical Information 2 J/ ~"

C
139
I 2 2
o=c-c-c cc
a. ) Stage 1 Connectivity Values = |l,2,3J. No. of Distinct Values = 3.
2 + 2 + 2-^
4
2 3 5 6/^~~ 0=C C - C ^C0

2
b. ) Stage 2 Connectivity Values = J2,3,4,5,6}. No. of Distinct Values = 5. 445 ++

N
o=c-c-c '
\
7 9 /J^C 0
8
C-C c.) Stage 3 Connectivity Values = j 3,4,7,8,9,13[ No. of Distinct Values = 6.

8+ 8+ 9
o=c-c-c^
7 12 20^25
17 0
l 7
C-C d. ) Stage 4 Connectivity Values = {7,8,12,17,20,25}. No. of Distinct Values = 6. 3

8
0 = C
6 7 5 2 C S 6 3 O S 7 4 C S 8 5 O D
5 "
C
y C 0 ,
- <
c-c
Atom No. Attachments Elements Bonds 1 2 3 4 1 1 1 C C C C S S S
e. ) Initially Numbered Connection Table. Arbitrary Choice Between 3 and 4. 8 5 2

4 7 1yCO
o=c-c-c'
Atom No. Attachments Elements Bonds
c-c
1 2 3 4 1 1 1 C C C C S S S
5 2 C S
6 3 C S
7 4 O S
8 5 O D
f. ) Alternately Numbered Connection Table. Figure 9. Generation of a unique, unambiguous connection table
140
and r e t a i n e d , as shown i n Figure 9 e . Backing up t o the highest atom marked as an a r b i t r a r y c h o i c e , Atom 3, the other a l t e r n a t i v e , i s t r i e d and the r e p r e s e n t a t i o n i s renumbered from Atom 3. Figure 9f i l l u s t r a t e s the numbering and compact connection t a b l e r e s u l t i n g from t h i s a l t e r native. The connection t a b l e generated i s a l p h a b e t i c a l l y compared to the r e t a i n e d connection t a b l e . The attachment l i s t s of the r e t a i n e d and newly generated connected t a b l e are compared and they are e q u a l . The atom l i s t o f the newly generated connect i o n t a b l e i s compared and i s lower than the r e t a i n e d connection t a b l e , because C i n P o s i t i o n 6 o f the newly generated t a b l e i s lower than 0 i n the r e t a i n e d connection t a b l e . Therefore, the newly generated connection t a b l e i s r e t a i n e d . Since there are no other atoms noted as i n v o l v i n g an a r b i t r a r y c h o i c e , the r e t a i n e d t a b l e i s the s i n g l e i n v a r i a n t r e p r e s e n t a t i o n which i s s e l e c t e d as the r e p r e s e n t a t i o n f o r t h i s substance. The number o f a l t e r n a t e numberings which must be attempted i s dependent on the numbers o f atoms which have attachments with equal c o n n e c t i v i t y v a l u e s . A l l o f these v a r i o u s a l t e r n a t e numbering combinations must be attempted. Consequently, the algorithm does not provide a p r a c t i c a l s o l u t i o n t o the general graph isomorphism problem. However, because the graphs c o r r e sponding to chemical s t r u c t u r e s t y p i c a l l y have c o n n e c t i v i t i e s o f 1, 2, 3, or 4, the a l g o r i t h m does provide a p r a c t i c a l way t o uniquely l a b e l v i r t u a l l y a l l graphs corresponding to a chemical structure. T h i s algorithm i s implemented on an IBM 370/168. As p a r t o f r o u t i n e p r o d u c t i o n at CAS, 13,000 substances per week are uniquely numbered through t h i s a l g o r i t h m at an average p r o c e s s i n g r a t e o f 1000 s t r u c t u r e s per minute o f CPU time. Since there are some h i g h l y symmetrical s t r u c t u r e s which would r e q u i r e a subs t a n t i a l number o f i t e r a t i o n s , the algorithm i s implemented to stop a f t e r three CPU seconds and use a r e g i s t r a t i o n approach based on a non-unique r e p r e s e n t a t i o n . For the 677,000 substances processed i n 1975, 990 substances could not be u n i q u e l y l a b e l e d w i t h i n the three seconds. Ferrocene, shown i n Figure 10, i s an example o f a s t r u c t u r e which would r e q u i r e 10! or 3,628,800 iterations. For substances o f t h i s type which cannot be u n i q u e l y l a b e l e d w i t h i n the three CPU second time l i m i t , an isomer-sort r e g i s t r a t i o n technique i s u t i l i z e d to complete the r e g i s t r a t i o n processes without human i n t e r v e n t i o n .
Figure 10. Ferrocene
KORN
141
Algorithm II - Substructure Search - Screen Generation. In an e a r l i e r s e c t i o n , bond-centered screens for substructure search are d e s c r i b e d . Below i s an algorithm for generating these screens. Given the connection t a b l e r e p r e s e n t a t i o n o f a chemical substance, the algorithm for the generation o f the bond-centered screens c o n s i s t s o f the f o l l o w i n g s t e p s : 1. Construct the set o f counts o f atoms, bonds, and connections and set the appropriate atom and r i n g system b i t s . S e l e c t the f i r s t / n e x t p a i r o f atoms. i s one o f
2. 3. Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch006
I f the p a i r i s CC, CN, or CO, determine i f i t the bonded p a i r s . I f not, go to Step 8. I f i t i s one o f the bonded p a i r s , go to Step
4. 5.
10.
I f i t i s not one o f the bonded p a i r s , determine i f i t i s one o f the augmented atom p a i r s . I f i t i s one o f the augmented atom p a i r s , go to Step I f i t i s not one o f the augmented atom p a i r s , go to Step 12. If i t i s one o f the simple p a i r s , go to Step set 12. 11.
6. 7.
8. 9.
I f i t i s not one o f the simple p a i r s , p a i r b i t s and go to Step 13. Set appropriate bonded p a i r b i t s . Set appropriate augmented p a i r b i t s . Set appropriate simple p a i r b i t s .
exception
10. 11. 12. 13. 14.
I f t h i s i s not the l a s t p a i r , go to Step
2.
I f t h i s i s the l a s t p a i r , the process i s complete.
Algorithm III - Interconversion - Connection Table to Structure Diagram. T h i s algorithm has as input the connection t a b l e r e p r e s e n t a t i o n o f a chemical substance and an a u t h o r i t y f i l e c o n t a i n i n g a coordinate r e p r e s e n t a t i o n o f a l l unique r i n g system shapes f o r a l l r i n g systems; an example o f input for one chemical substance i s shown i n Figure 11a. The manually b u i l t f i l e o f coordinate representations for the r i n g system shapes e l i m i n a t e s many o f the problems a s s o c i a t e d with a s s i g n i n g coordinates to r i n g systems. T h i s f i l e at CAS contains 15,000
142
ALGORITHMS
FOR CHEMICAL
COMPUTATIONS
r i n g system shapes which represent the r i n g shapes for v i r t u a l l y a l l r i n g systems o c c u r r i n g with 3.5 X 10^ d i s t i n c t substances i n the CAS Chemical R e g i s t r y System. The examples below i l l u s t r a t e features o f t h i s a l g o r i t h m . The algorithm p a r t i t i o n s the connection t a b l e i n t o three groups: r i n g systems, the l a r g e s t connected s u b s t r u c t u r a l u n i t s i n which a l l edges are i n a c y c l e ; c h a i n s , l i n e a r a c y c l i c s t r i n g s with one terminal atom; and l i n k s , l i n e a r a c y c l i c s t r i n g s w i t h out any terminal atoms. The algorithm s u b s t i t u t e s commonly recognized shortcut symbols f o r v a r i o u s groups o f atoms, e . g . , Me f o r the methyl group and Ph f o r the benzene r i n g . Figure l i b i l l u s t r a t e s these processes. The most c e n t r a l r i n g system i s i d e n t i f i e d , i t s p r e - s t o r e d r i n g shape i s r e t r i e v e d , and the nodes and the bonds o f the r i n g system are mapped i n t o the r i n g shape. The atom characters and bond v e c t o r s are c a l c u l a t e d based on the coordinates of the r i n g shape, shown i n Figure 11c. I f there are no r i n g systems, the most c e n t r a l a c y c l i c atom i s used as the s t a r t i n g p o i n t . With the most c e n t r a l r i n g system as the base s t r u c t u r e , the d i r e c t i o n , bond angle, and bond length are determined, f i r s t for the attached l i n k s and then for the chains attached to the r i n g systems. For l i n k s , the d i r e c t i o n i s away from the base s t r u c t u r e , and i s h o r i z o n t a l or v e r t i c a l based on the angle nearest to the b i s e c t i n g angle of the r i n g perimeter. For c h a i n s , the d i r e c t i o n i s away from the base s t r u c t u r e and b i s e c t s the r i n g perimeter angle. A standard length bond i s used. Figure l i d i l l u s t r a t e s these processes. For l i n k s and chains attached to the base s t r u c t u r e , the coordinates o f the atoms and bonds o f the component are d e t e r mined. The coordinates o f the f i r s t atom attached to the r i n g system are determined. Coordinates for the next atom are above, below, to the r i g h t , or to the l e f t , and they are determined based on the drawing d i r e c t i o n . H o r i z o n t a l s i n g l e bonds are drawn i m p l i c i t l y ; a l l other bonds are drawn e x p l i c i t l y . All atoms i n the l i n k or chain are placed s i m i l a r i l y . When the coordinates o f a l l l i n k s and chains are determined, the l i n k
a) Connection Table for Substance and Coordinate Representation for Ring Shapes.
Figure 11. Generation of a coordinate representation from a connection table (continued on facing page)
O'KORN
143
XJUfe^N^
Ring System 1 ^""H
Ring System 2 b) Partitioning of Atoms into Ring Systems, Links, and Chains, and Substitution of Shortcut Symbols. Chains 1 Links ( I )
c) Identification and Placement of Most Central Ring System.
d) Determination of Bond Direction, Angle, and Length for Chains and Links. ÎW
MeO\ MeO"
^Me ' C H 2CH2
e) Placement of Links and Chains.
f) Identification and Placement of Second Ring System.
g) Placement of Chains
M e
W V
J^s. J!
CHpCHo-N^^N I I ^Ph
M e O ^ ^ ^ h) Results of Display Procedure.
Figure 11. Generation of a coordinate representa tion from a connection table (continued from facing page)
144
ALGORITHMS
FOR CHEMICAL
COMPUTATIONS
or chain i s p o s i t i o n e d r e l a t i v e to the r i n g system, as i l l u s t r a t e d i n Figure l i e . A l l other l i n k s and the chains attached to the most c e n t r a l r i n g system are p o s i t i o n e d i n a s i m i l a r manner. When a l l l i n k s and chains attached to the most c e n t r a l r i n g system are p l a c e d , the next r i n g system and i t s r i n g shape are r e t r i e v e d . (Note that i n t h i s example there are no l i n k s or chains attached to the attached l i n k s and c h a i n s . ) The atoms and bonds are mapped i n t o the r i n g shape, and the atom characters and bond v e c t o r s are c a l c u l a t e d from the coordinates of the r i n g system, as i l l u s t r a t e d by F i g u r e l l f . The o r i e n t a t i o n o f each r i n g system a f t e r the f i r s t must r e f l e c t how i t i s attached to the base s t r u c t u r e . In order to allow f o r a t t a c h i n g i t to the base s t r u c t u r e , i t may be necessary to r e f l e c t the r i n g system about the x - a x i s , the y - a x i s , or b o t h . I f a second r i n g system with attachments i s p r e s e n t , the d i r e c t i o n , bond angle, and bond length f o r chains and l i n k s attached to the second r i n g system are then determined, as shown i n Figure l l g . Following t h i s , the coordinates o f l i n k s and chains attached to the second r i n g system are attached. If attachments are present on the l i n k s and chains attached to the
Figure 12.
Example of photocomposer output
Figure 13.
Example of electrostatic printer output
OKORN
145
second ring system, they would be positioned at this point. The second ring system with its attachments is then attached to the base structure. Since a l l components of the substance have been processed, the display is complete; that is, a coordinate repre sentation has been derived. The results of this process are illustrated in Figure l l h . Throughout this process, as each component is added to the base structure, i t is tested for overlap. If overlap is de tected, it is resolved by extending the bond length and/or ad justing the bond angle. Since this algorithm uses the coordinate representation described earlier, movement of each component to be added requires updating the coordinates of the node associated with that component rather than the coordinates of each atom involved. This algorithm produces highly acceptable results. With initial implementation, considerations for handling special cases of substances, e.g., coordination compounds, polymers, and incompletely defined structures, were deferred. The algorithm will generate images for many of these structures but acceptabil ity is dependent on use. It is estimated that the current version of the algorithm will generate a highly acceptable (by CAS internal drawing standards) coordinate representation for 8 % 5 of the 3.5 16 unique substance in the CAS Chemical Registry System. The algorithm requires 266K bytes of main storage for executable instructions and processes 8 substances per CPU second on an IBM 370/168. Within the CAS Composition Facility, the device-independent coordinate representation generated by this algorithm can be converted to the device-specific coordinates of the Autologic APS-4 photocomposer for high graphical quality output - - illus trated by Figure 12 - - or to the Varian Status 21 electrostatic printer for low cost worksheet production - - illustrated by Figure 13. Literature Cited 1. 2. 3. 4. "Toward a Modern Secondary Information System for Chemistry and Chemical Engineering," Chemical & Engineering News, 53, 30 (16 June 1975). Lynch, Michael F . , Computer-Based Information Services in Science and Technology, Peter Poregrinus Ltd., Herts, England, 1974. ADI/ASIS, Cuadra, Carlos A. (ed.), Annual Review of Informa tion Science and Technology, 1-10, Wiley/Interscience, (19661974). Lynch, Michael F . , Judith M. Harrison, William G. Town, and Janet E. Ash, Computer Handling of Chemical Structure Infor mation, American Elseview Publishing Company, Inc., N w e York, N.Y., 1971.
146
5.
6.
7. 8.
9. 10.
11. 12.
13. 14.
15. 16. 17.
Davis, Charles H. and James E. Rush, "Information Retrieval and Documentation in Chemistry," in Contributions in Librarianship and Information Science, Number 8, Greenwood Press, Westport, Connecticut, 1974. Donaldson, N . , W H. Powell, R. J. Rowlett, Jr., R. W . . White, and . V. Yorka, "CHEMICAL ABSTRACTS Index Names for Chemical Substances in the Ninth Collective Period (19721976)," Journal of Chemical Documentation, 14(1), 3-14 (1974). Oatfield, Harold, "The ARCS System: Ringdoc as Used with a Computer," Journal of Chemical Documentation, 7(1), 37-43 (1967). Dittmar, P. G., R. E. Stobaugh, and C. E. Watson, "The Chemical Abstracts Service Chemical Registry System. I. General Design," Journal of Chemical Information and Computer Sciences, 16(2), 111-121 (1976). Lefkovitz, David, "A Chemical Notation and Code for Computer Manipulation," Journal of Chemical Documentation, 1(4), 186192 (1967). Dubois, J. ., "DARC System in Chemistry," in Computer Repre sentations and Manipulation of Chemical Information, ed. W T. Wipke and others, John Wiley & Sons, Inc., Nw York, . e N.Y., 1974. Farmer, N. A. and J. C. Schehr, "A Computer-Based System for Input, Storage and Photocomposition of Graphical Data," Proceedings of the ACM, Vol. 2, 563-570 (November 1974). Brown, H. D., Marianne Costlow, Frank A. Cutler, Albert N. DeMott, Walter B. Gall, David P. Jacobus, and Charles J. Miller, "The Computer-Based Chemical Structure Information System of Merck, Sharp and D h e Research Laboratories," om Journal of Chemical Information and Computer Sciences, 16(1), 5-10 (1976). Eakin, Diane R., "The ICI C O S O System," in Chemical R SB W Information Systems, ed. J. E. Ash and E. Hyde, John Wiley & Sons, Inc., N w York, N.Y., 1975. e Morgan, H. L . , "The Generation of a Unique Machine Descrip tion for Chemical Structures -- A Technique Developed at Chemical Abstracts Service," Journal of Chemical Documenta tion, 5(2), 107-113 (1965). Fugmann, R., "The IDC System," in Chemical Information Systems, ed. J. E. Ash and E. Hyde, John Wiley & Sons, Inc., N w York, N.Y., 1975. e Craig, P. N. and . M. Ebert, "Eleven Years of Structure Retrieval Using the SK&F Fragment Codes," Journal of Chemical Documentation, 9(3), 141-146 (1969). Fisanick, W., L. D. Mitchell, J. A. Scott, and G. G. Vander Stouw, "Substructure Searching of Computer-Readable Chemical Abstracts Service Ninth Collective Index Nomenclature Files," Journal of Chemical Information and Computer Sciences, 15(2), 73-84 (1975).
6 O'KORN Computer Handling of Chemical Information
147
18. Dyson, G. ., "The Dyson-IUPAC Notation," in Chemical Infor mation Systems, ed. J. . Ash and E. Hyde, John Wiley & Sons, Inc., N w York, N.Y., 1975. e 19. Granito, Charles . and Eugene Garfield, "Substructure Search and Correlation in the Management of Chemical Infor mation," Naturwissenscheften, 60(4), 189-197 (1973). 20. Ray, L. C. and R. A. Kirsch, "Finding Chemical Records by Digital Computers," Science, 126, 814-819, (1957). 21. Sussenguth, Edward ., Jr., "A Graph-Theoretic Algorithm for Matching Chemical Structures," Journal of Chemical Docu mentation, 5(1), 36-43 (1965). 22. Adamson, George W., Jeanne Cowell, Michael F. Lynch, Alice H. W McLure, William G. Town, and Margaret A. Yapp, . "Strategic Considerations in the Design of a Screening System for Substructure Searches of Chemical Structure Files," Journal of Chemical Documentation, 13(3), 153-157 (1973). 23. Feldman, Alfred, and Louis Hodes, "An Efficient Design for Chemical Structure Searching I, the Screens," Journal of Chemical Information and Computer Sciences, 15(3), 147-151 (1975). 24. Granito, Charles E . , "CHEMTRAN and the Interconversion of Chemical Substructure Search Systems," Journal of Chemical Documentation, 13(2), 72-74 (1973). 25. Campey, Lucille ., E. Hyde, and Angela R. H. Jackson, "Interconversion of Chemical Structure Systems," Chemistry in Britain, 6(10), 427-430 (1970). 26. Zamora, Antonio, and David L. Dayton, "The Chemical Abstracts Service Chemical Registry System. V. Structure Input and Editing," to be published in the August 1976 issue of Journal of Chemical Information and Computer Sciences. 27. Feldman, R. J., "Interactive Graphic Chemical Structure Searching," in Computer Representation and Manipulation of Chemical Information, ed. W T. Wipke, John Wiley, N.Y., 1974. . 28. Vander Stouw, G. G., P. M. Elliott, and A. C. Isenberg, "Automated Conversion of Chemical Substance Names to AtomBond Connection Tables," Journal of Chemical Documentation, 14(4), 185-193 (1974). 29. Hyde, E . , F. W Matthews, Lucille H. Thompson, and W. J. . Wiswesser, "Conversion of Wiswesser Notation to a Connecti vity Matrix for Organic Compounds," Journal of Chemical Documentation, 7(4), 200-203 (1967). 30. Tauber, S. J., S. J. Fraction, and H. W Hayward, Chemical . Structures as Information-Representations, Transformations, and Calculations, Spartan Books, Washington, D. C., 1965. 31. Ebe, Tommy, and Antonio Zamora, "PATHFINDER II, A Computer Program That Generates Wiswesser Line Notations for Polycyclic Structures," Journal of Chemical Information and Computer Sciences, 16(1), 36-39 (1976).
148
32. Bowman, C. M., F. A. Landee, N. W Lee, and M. H. Reslock, . "A Chemically Oriented Information Storage and Retrieval System II. Computer Generation of the Wiswesser Notation of Complex Polycyclic Structures," Journal of Chemical Documentation, 8(3), 133-138 (1968). 33. Farrell, C. D., A. R. Chauvenet, and D. A. Koniver, "Computer Generation of Wiswesser Line Notation," Journal of Chemical Documentation, 11(1), 52-59 (1971). 34. Rogers, . . T., "CROSSBOW," presented at the 158th National Meeting of the American Chemical Society, Nw York, N.Y., e September 1969. 35. Zimmerman, B. L . , Computer-Generated Chemical Structural Formulas with Standard Ring Orientations, Ph. D. Disserta tion, University of Pennsylvania, Philadelphia, Pennsylvania, 1971. 36. Dittmar, Paul G. and Joseph Mockus, "An Algorithmic Computer Graphics Program for Generating Chemical Structure Diagrams," submitted to Journal of Chemical Information and Computer Science. 37. Barrow, H. G. and R. M. Burstall, "Subgraph Isomorphism, Matching Relational Structures, and Maximal Cliques," Information Processing Letters, 4(4), 83-84 (January 1976). 38. Hopcroft, J. E. and R. E. Tarjan, "Isomorphism of Planar Graphs," in Complexity of Computer Computations, ed. Raymond E. Miller and James W Thatcher, Plenum Press, Nw York, . e 1972. 39. Sanders, Alton F . , "Graph Theoretical Constraints on Linear ization Algorithms for Canonical Chemical Nomenclature," presented at the 169th National Meeting of the American Chemical Society, Philadelphia, April, 1975. 40. Jurs, P. C. and T. L. Isenhour, Chemical Applications of Pattern Recognition, John Wiley & Sons, Inc., Nw York, e N.Y., 1975. 41. Valls, J., "Chemical Reaction Indexing," in Chemical Infor mation Systems, ed. J. E. Ash and E. Hyde, John Wiley & Sons, Inc., Nw York, N.Y., 1975. e 42. Marshall, G. R., . E. Bosshard, and R. A. E l l i s , "Computer Handling of Chemical Structures: Applications in Crystal lography, Conformational Analysis, and Drug Design," in Computer Representation and Manipulation of Chemical Infor mation, ed. W T. Wipke and others, John Wiley & Sons, Inc., . Nw York, N.Y., 1974. e 43. Bersohn, ., and A. Esack, "Computers and Organic Synthesis," Chemical Reviews, 76(2), 269-282 (1976).
INDEX
Coefficient, transmission 74 Colimator 101 Complexity 7 Computer handling of chemical information 122 Configuration interaction 36 Connection table(s) 127 canonicalization of 136 to structure diagram 141 63, 82 Connectivity, decomposition by 13 80 Continental divide 77 63 Conversion coefficient 74 Coordinate representation 127 116 Coordinate system 98 Counting the trajectories 72 CRYM 117 Benzene 23 Crystal deterioration, treatment of .... 105 Boltzmann exponential 79 Crystallographic calculations, automating 116 Bond 15 98 Born approximation, distorted wave .. 54 Crystallography, x-ray 103 Born-Oppenheimer 22 Curves, Gaussian 11,15 Bottleneck s ) 68,84 Cycle critical 84 D equilibrium in 69,76 finding 87 DARC 127 rate-limiting 84 De Broglie wavelengths 83 simulation of infrequent events 95 Decomposition algorithm 15 Bragg angle 107 Decomposition method 11 Bystander 82 Density, canonical equilibrium probability 83 Density, classical equilibrium C probability 83 Calculations 12 automating crystallographic 116 Descendants Deterioration, crystal treatment of ... 105 molecular mechanics "strain Diffractometers, x-ray 99 energy 115 DIREC 112 probability factor 78 Distribution in the bottleneck, quasiclassical trajectory 84 equilibrium 76 Canonical equilibrium probability Dynamics, molecular 67 density 83 Canonical form 12 Canonicalization of connection tables 136 46 Chebyshey series 32 Eigenvalue algorithms 36 Children 12 Eigenvalue equations Energy, free Helmholtz 80 Classical equilibrium probability density (Peq) 83 Equilibrium in the bottleneck 69 Classical path 56 distribution in the bottleneck 76 Close coupling 53 probability density, canonical 83 Code 13 Codes, fragment 125 probability density, classical 83 Coefficient, conversion 74 EulerLagrange equations 35
Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ix001
Adiabatic mapping Adjacency structure Algorithms, sample Ancestors Angle, Bragg Anharmonic tunneling Approximation ( s ) harmonic relation of exact TST to stochastic Automating crystallographic calculations
87 5 136 12 107 83
149
150
F
ALGORITHMS
FOR CHEMICAL
COMPUTATIONS
M
91 74 74 110 110 110 125 80 32 MAGIC 112 Mapping, adiabatic 87 Matrix, adjacency 5 Matrix manipulations 39 MC method 64 MC-SCF 26 MD program 64 Mechanics, classical 55 Microcanonical ensemble 79 Molecular chaos 63 dynamics 63, 67 mechanics "strain energy" calculations 115 scattering calculations, selection of algorithms for 52 Molecules, representation of 2 Monte Carlo method 64 Monte Carlo transition algorithms .... 93 Multiconfiguration SCF 26
Factor normalizing probability trajectory-corrected frequency Fast fourier transform, use of FFT Fourier transform, use of fast Fragment codes Free energy, Helmholtz Functions, cartesian gaussian
G
Gaussian curves Gaussian functions, cartesian Generation, screen Geometry, goniometer Goniometer geometry Graph(s) algorithms planar theory triconnected 103 32 141 101 101 1 11,16 2 15
Nesbet-Shavitt method Newton-Raphson solution Nomenclature, systematic Normalizing factor Notation, linear NP-complete problems 48 35 124 91 125 11
H
Hamiltonian Harmonic approximation relation of exact TST to Harmonic hyperplane Hartree-Fock equations Hartree-Fock function Heating Height Helmholtz free energy Hidden line algorithms Hidden surface problem Hydrological construction Hyperplane, harmonic Hyperplanes, unstable-mode 80 63, 82 80 82 21 34 90 12 80 107 107 86 82 94
ORTEP program Orthonormal orbitals 108 36
Parent Participant Peq density Perturbation Probability density, canonical equilibrium density, classical equilibrium factor Program size Pushing 12 82 83 22 83 83 74,78 7 90
I
Incident Information, chemical substance Integral calculation Interconversion Isomorphic Isomorphism algorithm subgraph 3 123 28 134,141 3 8 18 3,8
Q
Quadratic minimum Quadratic saddle point Quantum chemistry corrections -mechanical scattering theory, nonrelativistic Quasiclassical trajectory calculations .. 80 79 21 82 52 84
J
Jahn-Teller-Renner 23
L
Leaves Linear notation 12 125
INDEX
151
R
Random walk Rate-limiting bottlenecks Registration Root Roothaan SCF equation Running time
S
63 84 128,136 12 26 7
Successful transition, definition of Sussenguth set reduction technique .. Symmetry Systematic nomenclature
75 131 98 124
Theory, transition state 63, Thermalization Thomas-Fermi density model Trajectory ( -ies ) calculations, quasiclassical conditions -corrected frequency factor counting the Transformations Transition(s) algorithms, Monte Carlo definition of successful spontaneous state theory 63, Transmission coefficient Tree, decomposition Trees, codes for TST relation to harmonic approximation Tunneling 67 87 21 70 84 71 74 72 40 93 75 79 67 74 15 11 67 80 83
Saddle point, quadratic 79 Sample algorithms 136 Scattering, quantum 53 Scattering theory 52 algorithm for choosing appropriate 57 criteria for choosing appropriate .... 53 SCF 26 Schrodinger equation 21 Screen generation 141 Search, substructure 141 Searching, substructure 129 Self-consistent field 34 Self-consistent-field equation 26 Semiclassical methods 55 Set reduction technique, Sussenguth .. 131 Simple path 11 Simulation of infrequent events, bottleneck 95 Slater determinant ( s ) 26,38 Slater orbitals 28 Spin-eigenfunction 38 Spontaneous transitions 79 Stochastic approximations 63 Storage 7,39 Strain energy calculations, molecular mechanics 115 Structure diagram, connection table to 141 Substructure search 141 Substructure searching 129 Subtree 12
U
Unstable-mode hyperplanes W Watershed Wavelengths, de Broglie WBK approximation 77 83 57 94
X
X-ray crystallography X-ray diffractometers 98 99

Algorithms For Chemical Computations (Acs Symposium Series No 46)

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Algorithms For Chemical Computations (Acs Symposium Series No 46)

Transféré par

Droits d'auteur :

Formats disponibles

Algorithms for Chemical Computations

R a l p h E. Christoffersen, The University of Kansas

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.fw001

ACS SYMPOSIUM SERIES

AMERICAN WASHINGTON, D. C. 1977

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.fw001

ACS Symposium Series

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.fw001

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.fw001

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.pr001

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch001