Vous êtes sur la page 1sur 10

ANALYTICAL BIOCHEMISTRY ARTICLE NO.

235, 110 (1996)

0084

REVIEW Methods to Estimate the Conformation of Proteins and Polypeptides from Circular Dichroism Data
Norma J. Greeneld
Department of Neuroscience and Cell Biology, UMDNJRobert Wood Johnson Medical School, 675 Hoes Lane, Piscataway, New Jersey 08854-5635

Received August 23, 1995

Circular dichroism (CD) is an excellent method for analyzing the conformation of proteins and peptides in solution. This review compares various methods of obtaining structural information from CD data and the advantages and pitfalls of each technique are detailed. Among the topics discussed are how does the wavelength range of data acquisition affect the precision of the determination of protein conformation, how precisely must the protein concentration be determined for each method to give reliable answers, and what computer resources are necessary to use each method.
1996 Academic Press, Inc.

THE ORIGIN OF CIRCULAR DICHROIC ACTIVITY OF PROTEINS

Circular dichroism spectroscopy (CD)1 is a technique valuable for analyzing the secondary structure of proteins in solution. This article is designed to acquaint nonexperts with some of the modern methods used to extract structural information from CD spectra. In the body of the review the following are discussed: i, what contributes to the CD spectrum of a protein or polypeptide; ii, how must samples be prepared for CD analysis; iii, what are the commonly used methods for extracting secondary structural information from CD data, and what computer resources are needed to use each method; iv, how precisely must the protein concentration be determined for each method to give reliable answers; and v, how does the wavelength range affect the precision of the answers obtained using each method. For further reading, there are many excellent review articles in the literature on the theory and use of CD (16).
1 Abbreviations used: CD, circular dichroism; P2, poly-L-proline II; SVD, singular value decomposition; CCA, convex constraint analysis.

Circular dichroism is a phenomenon that results when chromophores in an asymmetrical environment interact with polarized light. In proteins the major optically active groups are the amide bonds of the peptide backbone and the aromatic side chains. Polypeptides and proteins have regions where the peptide chromophores are in highly ordered arrays, such as a-helices or b-pleated sheets. Depending on the orientation of the peptide bonds in the arrays, the optical transitions of the amide bond can be split into multiple transitions, the wavelengths of the transitions can be increased or decreased, and the intensity of the transitions can be enhanced or decreased. As a consequence, many common secondary structure motifs, such as the a-helix, b-pleated sheets, b-turns, and poly-L-proline II (P2), have very characteristic CD spectra. The left-handed helical P2 conformation is found in collagen and has recently been identied in short segments of some globular proteins (7, 8). Spectra of representative polypeptides with these conformations are shown in Fig. 1.
SAMPLE PREPARATION FOR CD MEASUREMENTS

For meaningful CD analyses, samples must be free of contaminating proteins, which might contribute to the nal spectrum, and other optically active impurities, such as nucleotides or optically active buffers (e.g., glutamate). CD measurements should be made on samples with a maximum absorbance (including buffer) of 1.0 at the wavelength region of interest. Amide bonds have optical transitions in the ultraviolet region below 250 nm. One can obtain useful estimates of protein conformation from data obtained only between 240 and 200 nm (see below), which means that proteins may
1

0003-2697/96 $18.00 Copyright 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.

NORMA J. GREENFIELD

of determining protein concentration include (i) quantitative amino acid analysis and (ii) determination of peptide backbone concentration by the measurement of biuret (11) (note that reducing agents interfere with this assay) or total nitrogen (12). It is also possible to use the aromatic spectrum of the protein to measure its concentration, provided the spectrum is obtained under denaturing conditions (13, 14).
METHODS TO ANALYZE PROTEIN CONFORMATION

FIG. 1. Circular dichroism spectra of polypeptides in the a-helical, b-pleated sheet, b-turn, and P2 conformations. ( ) a-Helix, (l) bsheet, and ( ) b-turn, redrawn from Brahms and Brahms (19), and ( ) P2 (poly-L-proline in 0.1 M acetic acid).

There are many methods to extract protein conformation in solution from CD data in the literature. Basically, all of these methods assume that the spectrum of a protein can be represented by a linear combination of the spectra of the secondary structural elements, plus a noise term which includes the contribution of aromatic chromophores given in Eq. [1]:

ul

FiS l i / noise

[1]

be examined in physiological buffers (i.e., phosphatebuffered saline with 1 to 2 mM EDTA and/or 1 to 2 mM dithiothreitol) at concentrations of approximately 0.1 to 0.4 mg/ml in cuvettes with a pathlength of 1 mm. Low concentrations of organic buffers, e.g., 2 mM Hepes, are also permissible. The information content of CD spectra, however, and therefore the precision of the structural estimates increase when the lower limit of the wavelength range is extended to the far ultraviolet. For the greatest precision it is recommended to collect data between 260 and 178 nm or even 168 nm (6, 9). For measurements below 195 nm it is necessary to use very transparent buffers, such as 10 mM potassium phosphate, and cuvettes with very short pathlengths (0.05 to 0.1 mm). Adler et al. (1) and Johnson (6) discuss sample preparation and circular dichroism instrumentation in full detail. Circular dichroism is a quantitative spectroscopic technique. The various secondary structures have ellipticity bands with both characteristic wavelengths and magnitudes. Therefore, with the exception of nonconstrained least-squares analysis (see below) all of the methods of CD analysis require a precise knowledge of protein concentration. For example, the method of Bradford (10) is not acceptable because the results are dependent on the aromatic content of the protein, and detergents can interfere with the analyses. The aromatic absorption spectrum of a protein depends on its conformation, so measurements of the absorbance of native proteins at 280 nm may only be used to determine their concentrations when the extinction coefcients have been determined precisely. In addition, oxidized dithiothreitol and 2-mercaptoethanol and light scattering all can increase the apparent absorbance of a protein solution at 280 nm. Recommended methods

where ul is the CD of the protein as a function of wavelength, Fi is the fraction of each secondary structure, i, and Sli is the ellipticity at each wavelength of each ith secondary structural element. In constrained ts the sum of all the fractional weights, Fi , must be equal to 1. The major methods of extracting structural information from CD spectra (more or less in historical order) are i, multilinear regression (1519); ii, singular value decomposition (20, 21); iii, ridge regression (22); iv, convex constraint analysis (2325); v, neural network analysis (2628); and vi, the self-consistent method (8, 29, 30). These methods are detailed below. Representative computer programs using these methods, which will run on IBM-compatible computers, are available on a diskette (see the Appendix). To compare the accuracy of the estimation of protein conformation from CD data, all of the above methods were used to evaluate the a-helical, total b-pleated sheet, and b-turn content of the same set of proteins. This set consisted of 16 proteins plus poly-L-glutamate that were suggested as standards by Sreerama and Woody (8, 30), who assigned their secondary structure from X-ray coordinates using the method of Kabsch and Sander (31). The results are shown in Table 1. The effects of truncating the data between 240 and 200 nm are also shown in Table 1. Sreerama and Woody (8, 30) analyzed the conformations of the proteins in Table 1 using some of the methods including singular value decomposition (SVD), ridge regression (the CONTIN program), variable selection (the VARSLC program), a self-consistent method (the SELCON program), and neural networks, alone and in combination. Their results for the individual methods are summarized in Table 1. Table 1 also shows the ts obtained using

CIRCULAR DICHROISM OF PROTEINS TABLE 1

Comparisons of Methods of Analyzing Protein Conformation from Circular Dichroism Data


a
Program Linear regression nonconstrained t MLR MLR Constrained t G&F LINCOMB LINCOMB LINCOMB LINCOMB LINCOMB LINCOMB LINCOMB LINCOMB LINCOMB LINCOMB LINCOMB Singular value decomposition SVD SVD SVD Convex constraint algorithm CCA CCA Ridge regression CONTIN CONTIN Variable selection VARSLC Variable selectionself-consistent SELCON SELCON SELCON SELCON SELCON SELCON SELCON Neural nets NN K2D Mean conformation { standard deviation of the 17 test proteins Data base Assignmenti Wavelength P

b-Sheet s
P

b-Turn
P

Reference

4 peptidesa 4 peptidesa Poly-L-lysineb 4 peptidesa 4 peptidesa 4 peptidesa 15 proteinsc 15 proteinsc 17 proteinsd,e 17 proteinsd,e 33 proteins f 33 proteins f 33 proteins f 23 proteins g 17 proteinsk,e 17 proteinsk,e 17 proteinsk,e, j 17 proteinsk 17 proteinsk 17 proteinsk,e 17 proteinsk,e 17 proteinsk,e 17 17 17 33 33 33 33 proteinsk,e proteinsk,e proteinsk,e proteins m,e proteins m,e proteins m,e proteins m,e

KS KS KS KS KS KS KS LG KS KS HJ HJ KS KS KS KS KS KS KS KS KS KS KS KS KS KS KS HJ HJ KS KS

240178 240200 240208 240178 240200 240208 240190 240190 240178 240200 240178 240200 240200 240195 240178 240200 240200 260178 240200 260178 240200 260178 260178 260190 240200 260178 240200 260178 240200 260178 240200

0.91 0.92 0.92 0.93 0.94 0.94 0.95 0.89 0.94 0.92 0.96 0.95 0.95 0.90 0.98 0.97 0.96 0.96 0.97 0.93 0.95 0.97 0.95 0.94 0.93 0.93 0.88 0.93 0.88 0.93 0.95

0.13 0.14 0.13 0.11 0.11 0.12 0.12 0.17 0.09 0.10 0.08 0.08 0.08 0.17 0.05 0.08 0.07 0.10 0.10 0.11 0.13 0.07 0.09 0.09 0.10 0.09 0.12 0.10 0.13 0.10 0.09

0.43 0.74 0.61 0.58 0.71 0.52 0.83 0.89 0.62 0.09 0.73 0.57 0.56 0.70 0.68 0.00 0.43 0.62 0.42 0.56 0.60 0.81 0.84 0.73 0.73 0.91 0.86 0.85 0.77 0.73 0.77

0.21 0.16 0.18 0.15 0.13 0.16 0.26 0.15 0.14 0.28 0.13 0.15 0.16 0.12 0.12 0.27 0.14 0.18 0.20 0.15 0.15 0.10 0.08 0.09 0.11 0.07 0.09 0.09 0.11 0.11 0.10

0.07 0.23 ND 0.61 0.53 0.57 0.14 0.05 0.21 0.52 0.38 0.11 0.25 00.27 0.22 00.56 0.04 0.39 0.52 0.58 0.74 0.60 0.77 0.84 0.71 0.53 0.46 0.43 0.36 0.82 ND

0.16 0.16 ND 0.11 0.14 0.13 0.19 0.17 0.13 0.12 0.11 0.18 0.17 0.22 0.10 0.27 0.13 0.18 0.22 0.08 0.07 0.07 0.05 0.05 0.06 0.09 0.09 0.09 0.09 0.05
k

l l

17 proteinsk,e 19 proteinsh,e

0.36 { 0.27

0.20 { 0.16

0.22 { 0.08

a The spectra of the a-helix (sperm whale myoglobin corrected for the contributions of turns and random coil and normalized to 1.0) in 0.1 M NaF, pH 7, b-sheet (poly(lys-leu)n, in 0.5 M NaF at pH 7) random coil (poly(pro-lys-leu-lys-leu)n in salt free solution), and b-turn (poly(ala2,gly)n in water multiplied by 0.5) (from Brahms and Brahms (19)). b The spectra of poly-L-lysine in the a-helical, b-pleated sheet, and random-coil conformations (15). c Standard curves for a-helix, b-structure, random coil, and b-turn extracted from 15 proteins by multilinear regression by Yang et al. (3). d Standard curves for a-helix, total b-sheet, b-turn, and remainder extracted as described by Yang et al. (3) from the 16 proteins plus poly-L-glutamate utilized as standards by Sreerama and Woody (8, 30). e Each protein analyzed was excluded in turn from the data set. f Standard curves for a-helix, antiparallel b-sheet, parallel b-sheet, b-turn, and random coil extracted as described in footnote d from the spectra of 33 proteins (data supplied by A. Tourmadje and W. C. Johnson Jr.). g Standard curves for a-helix, b-turn and/or parallel b-pleated sheet, aromatic and disuldes, unordered, and antiparallel b-pleated sheet extracted from a data base of 23 proteins by convex constraint analysis by Perczel et al. (25). h Data base of Andrade et al. (27). i The assignments of secondary structure are by the methods of Kabsch and Sander (31) [KS], Levitt and Greer (32) [LG], and Hennessey and Johnson (20) [HJ]. j The values for poly-L-glutamate were omitted from the calculations of the correlation coefcient and mean square error because the sum of its conformations was 4, a clearly impossible answer. k Sreerama and Woody (30). l Sreerama and Woody (8). m Data supplied by A. Tourmadje and W. C. Johnson Jr.

4
TABLE 2

NORMA J. GREENFIELD

Properties of Computer Programs for Analyzing the Secondary Structure of Proteins and Polypeptides in Solutiona
Program name MLR G&F LINCOMB CONTIN VARSLC SELCON CCA K2D Minimum computer required Any PC Any PC Any PC Any PC 386d 386d 286 286 Output tted curve? Yes Yes Yes Yes No Yes Noe Yes Recommended wavelength range 240200c 240208 240200c 240200c 260184 260200c 240200c 240200 Minimum time of analysis (min)b 1 1 1 1 10 2 2f 1

and the relative time required for a single analysis are tabulated.
METHODS FOR ANALYZING CD DATA

Measurements of Helical Content at a Single Wavelength Measurements at single wavelengths are useful to follow the kinetics and thermodynamics of the folding of polypeptides and proteins. Representative equations for calculating helical content from the ellipticity at 222 nm (33) and at 208 nm (15) have been described. The major advantage of using single wavelengths is that data can be collected rapidly. The disadvantage is that the information content of measurements at a single wavelength is limited and other conformations such as b-sheet and turns and aromatic chromophores (1, 34) may interfere with the estimation of a-helical content. Estimating the Secondary Structure of Proteins and Polypeptides from CD Spectra by Multilinear Regression The simplest methods of analysis of protein secondary structure from CD spectra t the data to be analyzed to the spectra of standards by the method of least squares (multiple linear regression). In the earliest work, the spectra of polypeptides in known conformations were used as standards (15, 19). Later, when the conformation of a large number of proteins had been determined from X-ray crystallographic analysis, the CD spectra of these proteins were deconvoluted into basis spectra for the a-helix, antiparallel and parallel b-pleated sheets, b-turn, and random conformations by multiple linear regression analysis, and these extracted curves were used as standards (3, 1618, 35). In constrained least-squares ts (15) the sum of Fi must equal 1 (100%). In nonconstrained ts (19), the coefcients may be normalized to 100% after the t is obtained. Two computer programs for performing constrained least-squares analysis (the LINCOMB and G&F programs) and one for performing a nonconstrained analysis (the MLR program) are available on a diskette (see the Appendix). Nonconstrained least-squares analysis (MLR). Nonconstrained least-squares analysis, i.e., multilinear regression, is the only method which can be used to estimate conformation when protein concentration is not known precisely. Using the spectra of the polypeptide models suggested by Brahms and Brahms (19) as standards, the method gives a reasonable estimate of ahelical content, and there is some correlation between the calculated and found amount of b-structure, but the estimate of b-turns is very poor. The method is adequate, however, to indicate whether organic sol-

a Programs as supplied run on IBM-compatible computers. The program which output the tted curves requires a CGA-compatible graphics card. b Programs were tested on an IBM-compatible PC with an 80486 microprocessor operating at 33 MHz. c Fits may be improved by collecting data to shorter wavelengths. d Programs may also be compiled and run on any computer with a FORTRAN 77 compiler. e Theoretical curves can be constructed by summing the basis spectra multiplied by their fractional weights. f Time required for the deconvolution of 17 data sets containing 83 data points each into ve basis curves.

simple multilinear regression (the G&F, LINCOMB, and MLR programs) with a variety of peptide and protein data bases as references, convex constraint analysis (the CCA program), and a neural net retrieval program (K2D). When the data bases used by these methods were constructed using other methods of analyzing the secondary structure of proteins [i.e., the methods of Levitt and Greer (32) or Hennessey and Johnson (20)], the ts are also compared to the X-ray structures calculated by those methods. Table 1 lists the correlation coefcients, P, and the mean square errors, s, between the estimated and found contents of each secondary structure. The standard deviations from the average values of each conformation are also shown in Table 1. When the value of s for a particular conformation is not lower than the standard deviation from the mean value, it indicates that the method does a poor job of predicting that conformation. All of the methods did a good job of predicting a-helix, but they varied greatly in their ability to estimate b-content and turns (see below). The properties of several computer programs for analyzing CD spectra, which utilize each of the methods, are summarized in Table 2. Computer resources necessary for each analysis (i.e., type of microprocessor required), whether the program provides graphical comparison of the raw data and the tted curve, the minimum wavelength range required for the analysis,

CIRCULAR DICHROISM OF PROTEINS

vents, membranes, or ligands increase or decrease the helicity of a peptide or protein. Constrained least-squares analysis (G&F and LINCOMB). Constraining the sum of the fractional weights to equal 1 improves the estimate of b-sheet and b-turns when the method of least squares is employed. Use of the polypeptide standard curves of Brahms and Brahms (19) appears to give better estimates of protein secondary structure than use of standard curves extracted from the protein data bases (3, 18). There are several drawbacks to using the simple method of least squares, which analyzes the CD at each wavelength independently, however, to extract reference curves from a collection of protein spectra. First, the contributions of aromatic groups and structures other than a-helix, b-sheets, turns, and random coils are ignored. Second, in several wavelength regions, the spectra of the various conformations are not well distinguished from one another, and this can lead to large errors in the deconvolution of the spectra into standard basis curves. In addition, at least four types of b-turns have been identied, and several of these have spectra which are very different from one another (19, 36, 37). Thus, it is simplistic to try to extract the spectra of a generic turn from a data base using the method of least squares. The estimates of protein structure obtained using the least-squares programs are not as good as those determined by more modern methods (see Table 1). The programs, however, do have their uses. First, they all output the calculated and experimental curves graphically, so that one can directly observe whether the calculated t is a good match to the raw data. In addition, they may be used when data are obtained over a fairly limited wavelength range (240200 nm) with only slight loss of accuracy. The use of polypeptide standards has the benet that the ts obtained are not biased by the choice of proteins or the methods used to translate X-ray coordinates into secondary structures. Singular Value Decomposition (SVD) Hennessey and Johnson (20) suggested that SVD would be a better method of extracting information from a data base of protein CD spectra than the method of least squares. SVD is an eigenvector method of multicomponent analysis, which may be used to extract orthogonal basis curves from a set of spectra. After deconvolution, each basis curve, which has a unique shape, is related to a known mixture of secondary structures. The basis spectra are then used to analyze the conformation of unknown proteins. In SVD the sum of the fractional weights of each conformation is not constrained to equal 1. To have enough information to be used for conformational analysis, each basis spectrum must have unique maxima and minima called

nodes. To ensure that the spectra have sufcient nodes for successful deconvolution, Hennessey and Johnson (20) caution that the method cannot be used with a data set truncated above 184 nm. When the data set of 16 proteins plus poly-L-glutatmate is analyzed by SVD, the ts improve for a-helix, when compared to multilinear regression, are unchanged for total b-structure, and are poorer for the estimate of turns (see Table 1). Convex Constraint Analysis Perczel et al. (2325) developed an algorithm called convex constraint analysis (CCA), which, similar to SVD, deconvolutes a data base of spectra into components, but has different criteria for dening the basis curves. In CCA, the sum of the fractional weights of each component spectrum is constrained to be equal to 1. In addition, a constraint called volume minimization is dened which allows a nite number of component curves to be extracted from a set of spectra without relying on spectral nodes. CCA does not use X-ray crystallographic data in the deconvolution procedure. Once the basis curves are obtained, they must be assigned to specic secondary structures by correlating the fractional weight of each basis spectra with the fractional weight of each conformation of known proteins in the data set that has been deconvoluted. When the test data set is deconvoluted into 6 curves using the CCA algorithm, two of the basis curves have spectra similar to the spectra of a-helical peptides and their fractional weights must be added to estimate the total helical content. The estimate of total helix is very good and is independent of the wavelength range examined. The estimates of b-turns and b-pleated sheets, however, are poor compared to the other methods. It is difcult to relate the basis spectra extracted from a protein data base by CCA to specic conformations, however, because the secondary structures in a protein are not truly independent of one another. For example, for the 16 proteins in Table 1, the correlation coefcient of the fraction of antiparallel b-pleated sheets with the fractions of b-turns is 0.34 and between the unordered fractions and b-turns is 0.42. Thus, it is almost impossible to decide whether a given basis spectrum corresponds to the b-turns, antiparallel b-sheet, or unordered conformations in the protein data set, without additional outside information about the CD spectra of these conformations. While CCA is difcult to use for the a priori analysis of the secondary structure of unknown proteins, it is ideal for examining the spectra of proteins and polypeptides as a function of temperature, pH, or ligand binding. CCA easily determines the minimum number of CD spectra necessary to reconstruct all of the observed spectra and quanties the fractional weight of each component spectrum in the data set.

6 Selection Methods

NORMA J. GREENFIELD

The basis curves obtained from multilinear regression, singular value decomposition, or convex constrain analysis of a set of CD spectra may change greatly depending on the choice of proteins used as standards in the data base. This occurs because some proteins, whose conformations are known, may have unusual CD spectra, due to aromatic amino acids, disulde bridges, or rare conformations (38). To overcome these difculties several authors suggested that selection procedures should be employed so that the only proteins used as standards have spectral characteristics similar to those of the unknown protein whose conformation is to be evaluated. Methods which utilize various selection procedures include ridge regression, variable selection, and neural networks. Ridge regression analysis (CONTIN). Provencher and Glockner (22) proposed that the CD spectra of un known proteins could be t directly by a linear combination of the spectra of a large data base of proteins with known conformations. They developed a computer program, called CONTIN, which uses a variation of the method of least squares that is similar to a mathematical technique known as ridge regression. In their method, the contribution of each reference spectrum to that of the spectrum to be analyzed is kept small, unless it contributes to a good agreement between the theoretical best-t curve and the raw data. The CONTIN program gives a much better estimate of b-turns than simple multiple linear regression, SVD, or CCA (see Table 1) and truncating the data at 200 nm appears to have little effect on its prediction of protein conformation. The method still suffers, however, in that the ts depend on the choice of proteins in the data base of standards. Venyaminov et al. (39) suggest that estimates of conformational classes can be improved by including denatured proteins in the data base as references for the random conformation. Singular value decomposition with variable selection (the VARSLC program). Manavalan and Johnson (21) showed that the technique of variable selection can signicantly improve the estimate of protein conformation when combined with singular value decomposition (see Table 1). In the variable selection method (the VARSLC program), an initial data base of proteins with known spectra and secondary structures is selected. Some of the protein spectra are eliminated systematically to create new data bases with a smaller number of standards. Singular value decomposition is used on all of the reduced data sets to evaluate the conformation of the unknown protein. The results obtained using each set are then examined, and the ones fullling selection criteria for a good t are averaged. The VARSLC program gives an excellent evaluation of protein conformation in solution. Its major disadvantage is that it is

recommended that the program not be used unless data can be collected to at least 184 nm (6, 9). In addition, the program is relatively slow, since all of the combinations must be tested individually. The self-consistent method (SELCON). Sreerama and Woody (8, 29, 30) have made modications of the variable selection method which improve its speed and accuracy, which they call the self-consistent method (SELCON). In the SELCON program, rst the proteins in the data base are arranged in order of increasing root-mean-square difference from the CD spectrum to be analyzed, and the spectra which are least like the spectrum of interest are systematically deleted as described by van Stokkum et al. (40). This increases the speed of nding the best solutions. Second, the program utilizes the observation that prediction improves when the protein analyzed is included in the basis set, since the solution is biased toward the test protein structure. An initial guess of the structure of the protein to be analyzed is made and this conformation is included in the data base which is deconvoluted using SVD. The secondary structure of the protein is then determined. The solution replaces the initial guess and the process is repeated until self-consistency is attained. The SELCON program gives very good estimates of a-helix, b-structure, and b-turns of globular proteins and appears to work fairly well even when data are available only between 240 and 200 nm. Recently, Sreerama and Woody (29) modied the SELCON program so it can also determine the contribution of the P2 conformation to the spectra of globular proteins. There is a caveat in using the SELCON program. The program, with its current data base of reference spectra, does a relatively poor job of predicting the structure of polypeptides with very high contents of bpleated sheet, and it overestimates a-helix and underestimates b-sheet considerably. The errors may arise because the magnitude of the ellipticities of pure innite b-pleated sheets found in polypeptides and some protein aggregates is much higher than the ellipticities of the short b-sheets found in the globular proteins used as standards in the data base. Neural nets (K2D). A neural network is a computer program which can detect patterns and correlations in data. Bohm et al. (26) rst proposed that neural net works could be used to analyze CD and that the use of such computational techniques could signicantly improve the correlation between calculated and observed secondary structures. The application of neural networks to biochemical problems has been reviewed by Hirst and Sternberg (41). In neural networks there are three kinds of units: input units which receive signals from external sources and send signals to other units; output units which receive signals from other units and send signals to the environment; and hidden

CIRCULAR DICHROISM OF PROTEINS

units which receive inputs from other units and send output signals to other units, but do not directly receive data or output nal results. A neural network is formed by organizing units into layers. There can be connections between units in the same layer and connections between units in different layers. The units are connected together by neurons and the connections are numerically weighted so that the data used as input will result in the correct output. In the case of CD, the input patterns are the CD spectra and the output patterns are the fractional weights of the secondary structures. In neural network analysis there are two phases, the learning or training phase and the recall phase. In the learning phase connections are made between the points of the CD spectra and the secondary structure of standards and the weights of the connections are adjusted until the error between the calculated and actual secondary structures is minimized. In the recall phase, data not used in the learning phase are input and the corresponding output is calculated using the adjusted weights. In neural net analysis the learning phase can take many hours, but the recall phase takes seconds. Commercial software packages are available for performing neural net analyses (see Bohm et al. (26) and Sreerama and Woody (30) for sources). The neural network of Bohm et al. (26) consisted of 83 units in the input layer (corresponding to CD at 83 wavelengths between 260 and 178 nm), a hidden layer with 45 neurons, and an output layer with 5 neurons representing the a-helix, antiparallel and parallel bsheets, b-turns, and remainder. They found excellent prediction of a-helix and antiparallel b-sheet with correlation coefcients of 1.0 and 0.91, respectively. When the wavelength region was truncated between 250 and 200 nm, however, the prediction of a-helix remained excellent, but the prediction of b-sheet gave a negative correlation coefcient. Sreerama and Woody (30) analyzed their test proteins in a manner similar to that described by Bohm et al. (26). Their best results were obtained using two hidden layers and these results are summarized in Table 1. They found that prediction could be improved if variable selection was used in constructing the network. However, such an approach is probably not for the average user of CD, since the calculations are very time consuming and require software which is not yet generally available. Recently a somewhat different neural network procedure for analyzing CD data called proteinotopic mapping has been described (27, 28). The computer program, utilizing this method, is named K2D. It consists of a data base of weights and a recall program for determining a-helix and b-structure based on these weights. The program only utilizes data obtained between 240 and 200 nm as input and gives the best

estimates of b-sheet when only data with a limited wavelength range are available. It is rapid to use, and it has the advantage that it outputs the theoretical curve, which can be compared with the raw data, but it does not evaluate b-turns. The Estimation of Protein Tertiary Structure Class from Circular Dichroism Data While CD is most often used to determine secondary structural characteristics, it has been suggested (42, 43) that it can be of some use in determining some elements of tertiary structure as well. Proteins have been divided into ve structural classes on the basis of their secondary structure (42, 44): all-a (mainly ahelical), all-b (mainly b-pleated sheet), a / b (separate a-helix and b-sheet regions), a/b (intermixed a-helices and b-sheet regions), and random (predominantly unordered). Manavalan and Johnson (42) suggested that it should be possible to identify the structural class of a protein by visual inspection of its CD spectrum. They found that the all-a, a / b, and a/b proteins show pronounced negative CD bands at 222 and 208 nm and a positive band between 190 and 195 nm. They suggested that the all-a proteins could be distinguished from those containing some b-structure by the wavelength at which the CD changed from positive to negative below 180 nm. In all-a proteins the crossover is not until 172 nm, while it occurs at higher wavelengths in those with some b-structure. In addition, they suggested that a / b proteins could be distinguished from a/b proteins by the relative ratios of their bands at 222 and 208 nm. In the a / b type the 208-nm band is larger than the 222-nm band, while the relative intensities are reversed in a/b proteins. All b-proteins lack the characteristic a-helical peaks and can be divided into two types: those in which the spectrum resembled model b-polypeptides and those whose spectra resembled disordered polypeptides. Recently, Venyaminov and Vassilenko (43) have proposed that the mathematical technique of cluster analysis can be used to assign the class of a protein from its CD spectrum between 236 and 190 nm. A computer program to perform cluster analysis, DEF CLAS.EXE, is available from Dr. Venyaminov (see Appendix). When tested on 53 proteins (43) the program gave 100% accuracy in identifying all-a, a/b, and denatured proteins, 85% accuracy for identifying a / b proteins, and 75% accuracy for identifying all-b proteins. It should be noted, however, that the program performs poorly when tested on polypeptides which are 100% a-helical or 100% b-sheet, identifying them both as belonging to the a/b class.
SUMMARY AND RECOMMENDATIONS

All of the various methods of determining protein conformation from CD spectra give a reasonable esti-

NORMA J. GREENFIELD available circular dichroism spectra of a wide variety of proteins which have been useful as standards, both for the work reported in this review and for the work of countless other researchers. Gerald D. Fasman generously provided the compiled version of the CCA and LINCOMB programs which have been of invaluable use. He also provided the C source code and a new software analysis package with both the LINCOMB and CCA programs, which has a more sophisticated user interface currently being developed by Dr. A. Perczel. Dr. Sergei Yu. Venyaminov generously provided a compiled version of the CONTIN program which runs on personal computers. Dr. Miguel Andrade unselshly provided the C source code and compiled code and the weight table for the K2D neural network program. Lawrence E. Greeneld assisted greatly in the compiling of the FORTRAN programs. I also thank Sarah E. Hitchcock-DeGregori and Barbara Brodsky for their support and critical reading of the manuscript. This work was supported by NIH Grants GM36326 and HL35726 to SEHD and by the CD facility at UMDNJ.

mate of helical content. The SELCON program (8, 29, 30), using a data base of 17 references as standards, appears to give very good correlations between predicted and found a-helix, b-sheet, and b-turn for globular proteins. It is reasonably fast and is easy to use. Its use is highly recommended for the estimation of the structure of globular proteins in solution. However, it does a relatively poor job on estimating the structure of polypeptides, with very high contents of b-pleated sheet (see above). The K2D program (27, 28) gives a very good estimate of b-structure with data collected only between 240 and 200 nm and works well with polypeptides, but it does not estimate b-turns. The CONTIN (22) program, on the other hand, gives a good estimate of b-turns. It is suggested that both these programs be used in conjunction with the SELCON program for the best overall estimates of protein and polypeptide conformation. It should be emphasized that when using any of the methods to analyze CD data, the program which gives the calculated spectrum that best matches the experimental spectrum does not necessarily provide the best estimate of protein conformation. For example, the CONTIN program almost always gives excellent agreement between the experimental and calculated CD curves, even when the ts are relatively poor compared to other methods. On the other hand, the calculated curves obtained using the K2D program are often very poor matches to the experimental data, although the predictions of structure may be very good. When different methods of estimating protein structure give widely varying results the estimates of conformation should be regarded with suspicion. When protein concentration is not known precisely, only a nonconstrained least-squares analysis program, such as the MLR program, may be used to analyze protein secondary structure (19). This method gives inferior estimates, however, compared to all of the other methods. Although it appears to be less suitable for the routine determination of the secondary structure of proteins than SELCON or CONTIN, the CCA (2325) program is an excellent method of deconvoluting sets of CD spectra, e.g., to follow the effects of denaturants, ligands, or changes in temperature on protein and peptide conformation.
ACKNOWLEDGMENTS
I am heavily indebted to Robert W. Woody who provided the FORTRAN source code and compiled copies of the SELCON program and preprints of his recent articles describing both the SELCON method and applications of neural networks to the evaluation of circular dichroism spectra. He also provided an early version of the program, SELCON2, which evaluates the P2 conformation in proteins. W. Curtis Johnson Jr. graciously provided the FORTRAN source code and compiled versions of his VARSLC program. He has also freely made

APPENDIX

Computer Programs to Analyze Protein Conformation from CD Data The computer programs to analyze CD data, described in this review, are all available on diskette from N.J.G. upon request. Included are directions for using each program and programs to convert data les obtained on AVIV or JASCO spectrophotometers, or ASCII les of ellipticity as a function of wavelength, to the formats used by each of the programs. The following analysis programs are available: The LINCOMB program, supplied by Gerald D. Fasman, Graduate Department of Biochemistry, Brandeis University (Waltham, MA), analyzes the CD of unknown spectra by tting the spectra to that of standards by the constrained method of least squares (see Perczel et al. (25). Five data sets are currently available as standards. FASMAN.DAT contains the original basis spectra of Perczel et al. (25) extracted by convex constraint analysis of 23 proteins, which was supplied by Dr. Fasman with the LINCOMB program. BRAHMS. DAT contains the polypeptide reference spectra of Brahms and Brahms (19) and YANG.DAT contains the basis spectra of Yang et al. (3) extracted from 15 proteins by multilinear regression. T&J33.DAT and S& R17.DAT contain reference basis spectra extracted by multilinear regression (3) from data sets of the CD spectra of 33 and 17 proteins, supplied by W. Curtis Johnson (9) and Robert W. Woody (8), respectively. The LINCOMB program runs on 80286 computers and higher with a color graphics card. When the LINCOMB program is used with the YANG.DAT data base, the best t obtained, where all of the fractional weights are positive, is identical to the t obtained using the ESTIMATE program of Yang et al. (3). The PROSEC program, supplied with the AVIV CD spectrophotometer, is a version of the ESTIMATE program. MLR is a program that analyzes CD data by nonconstrained multilinear regression. It uses the same le formats and standards as the LINCOMB program (see

CIRCULAR DICHROISM OF PROTEINS

above). The program runs on all IBM-compatible computers, which have a color graphics card. G&F calculates the percentage a-helix and b-structure by the original method of Greeneld and Fasman (15) using poly-L-lysine as a reference. This program runs on all IBM PC-compatible computers. A color graphics card is necessary to view the graphs. Two programs which perform the self-consistent method of analyzing protein CD spectra of Sreerama and Woody (8, 29, 30) were contributed by Dr. R. W. Woody, Department of Biochemistry and Molecular Biology, Colorado State University (Fort Collins, CO). SELCON (8) evaluates up to ve conformations: a-helix, antiparallel and parallel b-sheets, turns, and remainder. SELCON2 (29) evaluates a-helix, total bstructure, turns, P2, and remainder. An 80386 or higher computer with a math coprocessor is recommended for these programs. The FORTRAN codes, which are also supplied, can be compiled and run on any computer with an F77 compiler. The VARSLC program is an implementation of the variable selection method of Manavalan and Johnson (21), which was contributed by W. Curtis Johnson, Department of Biochemistry and Biophysics, Oregon State University (Corvallis, OR). A data base of 33 proteins with ellipticities between 260 an 178 nm was also supplied by Dr. Johnson for use as standards with the program. An 80386 or higher computer with a math coprocessor is recommended for this program. The FORTRAN code, which is also available, can be compiled and run on any computer with an F77 compiler. When no proteins are excluded from the data base (i.e., the number of combinations is set to 1), the VARSLC program performs the simple SVD procedure described by Hennessey and Johnson (20). CCAFAST and CCASLOW are versions of the convex constraint analysis program of Perczel et al. (24, 25), which were also supplied by Dr. Gerald D. Fasman. CCAFAST requires a math coprocessor. CCASLOW runs on all PCs but is very slow. An 80286 or higher computer with a math coprocessor is recommended for this program. The CONTIN program, which performs the ridge regression technique of Provencher and Glockner (22, 38, 39), was contributed by Dr. S. Yu. Venyaminov, Department of Biochemistry and Molecular Biology, Mayo Foundation (Rochester, MN). This program works on 80286 or higher machines with a math coprocessor. A CD analysis package called CDSTRUC is available from Dr. Venyaminov. In addition to the CONTIN program it contains the ESTIMATE program of Yang et al. (3) and a version of the VARSLC program of Manavalan and Johnson (21) and the cluster analysis program called DEF_CLAS.EXE (43). The conversion programs that convert raw CD data to the CONTIN format will work with all of the programs in the package.

K2D is the neural net recall program of Andrade et al. (27) . It was contributed by Miguel Andrade, EMBL, Heidelberg, Germany. The program is also available on the world wide web at http://www.embl-heidelberg.de/ andrade/k2d.html. The program PLOTK2D will display the output le graphically after the program is run. An 80386 or higher PC with a math coprocessor is recommended for use with the program. A color graphics card is necessary to view the output with the PLOTK2D program. Computer text les with instructions for estimating helical content from the ellipticity at 222 (33) and 208 nm (15) and for estimating protein concentration (11 14) are also included on the diskettes.
REFERENCES
1. Adler, A. J., Greeneld, N. J., and Fasman, G. D. (1972) Methods. Enzymol. 27, 675735. 2. Woody, R. W. (1985) Peptides 7, 15114. 3. Yang, J. T., Wu, C-S. C., and Martinez, H. M. (1986) Methods Enzymol. 130, 208269. 4. Johnson, W. C., Jr. (1985) Methods Biochem. Anal. 31, 61163. 5. Johnson, W. C., Jr. (1988) Annu. Rev. Biophys. Chem. 17, 145 166. 6. Johnson, W. C., Jr. (1990) Proteins Struct. Funct. Genet. 7, 205 214. 7. Woody, R. W. (1992) Adv. Biophys. Chem. 2, 3779. 8. Sreerama, N., and Woody, R. W. (1993) Anal. Biochem. 209, 32 44. 9. Tourmadje, A., Alcorn, S. W., and Johnson, W. C., Jr. (1992) Anal. Biochem. 200, 321331. 10. Bradford, M. M. (1976) Anal. Biochem. 72, 248254. 11. Goa, J. (1953) Scand. J. Clin. Lab. Invest. 5, 218222. 12. Lang, C. A. (1958) Anal. Chem. 30, 16921694. 13. Edelhoch, H. (1967) Biochemistry 6, 19481954. 14. Gill, S. C., and von Hipple, P. H. (1989) Anal. Biochem. 182, 319326. 15. Greeneld, N., and Fasman, G. D. (1969) Biochemistry 8, 4108 4116. 16. Saxena, V. P., and Wetlaufer, D. B. (1971) Proc. Natl. Acad. Sci. USA 68, 969972. 17. Chen, Y-H., and Yang, J. T. (1971) Biochem. Biophys. Res. Commun. 44, 12851291. 18. Chang, C. T., Wu, C-S. C., and Yang, J. T. (1978) Anal. Biochem. 91, 1331. 19. Brahms, S., and Brahms, J. (1980) J. Mol. Biol. 138, 149178. 20. Hennessey, J. P., and Johnson, W. C., Jr. (1981) Biochemistry 20, 10851094. 21. Manavalan, P., and Johnson, W. C., Jr. (1987). Anal. Biochem. 167, 7685. 22. Provencher, S. W., and Glockner, J. (1981) Biochemistry 20, 33 37. 23. Perczel, A., Hollosi, M., Tusnady, G., and Fasman, G. D. (1991) Protein Eng. 4, 669679. 24. Perczel, A., Park, K., and Fasman, G. D. (1992) Proteins Struct. Funct. Genet. 13, 5769. 25. Perczel, A., Park, K., and Fasman, G. D. (1992) Anal. Biochem. 203, 8393.

10

NORMA J. GREENFIELD 36. Woody, R. W. (1974) in Peptides, Polypeptides and Proteins (Blout, E. R., Bovey, F. A., Goodman, M., and Lotan, N., Eds.), pp. 338360, Wiley, New York. 37. Perczel, A., and Fasman, G. D. (1992) Protein Sci. 1, 378395. 38. Venyaminov, S. Y., Baikalov, K. A., Wu, C-S. C., and Yang, J. T. (1991) Anal. Biochem. 198, 250255. 39. Venyaminov, S. Y., Baikalov, I. A., Shen, Z. M., Wu, C-S. C., and Yang, J. T.(1993) Anal. Biochem. 214, 1724. 40. van Stokkum, I. H. M., Spoelder, H. J. W., Bloemendal, M., van Grondelle, R., and Groen, F. C. A. (1990) Anal. Biochem. 191, 110118. 41. Hirst, J. D., and Sternberg, M. J. E. (1992) Biochemistry 31, 72117218. 42. Manavalan, P., and Johnson, W. C., Jr. (1983) Nature 305, 831 832. 43. Venyaminov, S. Y., and Vassilenko, K. S. (1994) Anal. Biochem. 222, 176184. 44. Levitt, M., and Chothia, C. (1976) Nature 261, 552558.

26. Bohm, G., Muhr, R., and Jaenicke, R. (1992) Protein Eng. 5, 191195. 27. Andrade, M. A., Chacon, P., Merolo, J. J., and Moran, F. (1993) Protein Eng. 6, 383390. 28. Merolo, J. J., Andrade, M. A., Prieto, A., and Moran, F. (1994) Neurocomputing 6, 443454. 29. Sreerama, N., and Woody, R. W. (1994) Biochemistry 33, 10022 10025. 30. Sreerama, N., and Woody, R. W. (1994) J. Mol. Biol. 242, 497 507. 31. Kabsch, W., and Sander, C. (1983) Biopolymers 22, 25772637. 32. Levitt, M., and Greer, J. (1977) J. Mol. Biol. 114, 181293. 33. Scholtz, J. M., Qian, H., York, E. J., Stewart, J. M., and Baldwin, R. L. (1991) Biopolymers 31, 14631470. 34. Chakrabartty, A., Kortemme, T., Padmanabhan, S., and Baldwin, R. L. (1993) Biochemistry 32, 55605565. 35. Bolotina, I. A., and Lugauskas, V. Y. (1986) Mol. Biol. 19, 1154 1166.

Vous aimerez peut-être aussi