Whelanch Rpe

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS
CHRISTOPHER WHELAN OREGON HEALTH & SCIENCE UNIVERSITY Abstract. The discovery of novel peptides with useful capabilities or characteristics could lead to signicant advances in elds such as materials science, nanotechnology, and medicine. However, the large size of the sequence search space, combined with the time required to experimentally test or simulate peptide behavior at the molecular level, makes statistical computational approaches attractive. We present a novel method for designing peptides based on sequence analysis, and apply it to two problems in peptide design: inorganic binding peptides and antimicrobial peptides. Peptides with the ability to bind to inorganic materials have many potential applications including medical devices, nanotechnology, and bone and tooth regeneration. Antimicrobial peptides have attracted attention as a potential source of therapeutic agents due to the rise of microbes resistant to traditional antibiotics. To design these peptides, we train a support vector machine classier that discriminates between positive and negative sequences based on the counts of n-grams of amino acid chemical classes. Using the model learned by the classier, we then build weighted nite-state transducers that we can sample or search for novel sequences sharing the characteristics of the positive training examples. We used this framework to produce a set of putative inorganic binding peptides, which we are testing experimentally. We also generated novel antimicrobial peptide sequences and used third-party prediction services to validate them, with strong initial results. We believe that our framework is exible and generally applicable to many problems in peptide design.
1. INTRODUCTION 1.1. Peptide Design Applications. Designed peptides have potential uses in a wide variety of applications, from medicine to materials science and nanotechnology. However, the design of new peptides is made dicult by the large search space and incomplete chemical knowledge. Each position in a peptide sequence can potentially hold any of the 20 standard amino acid residues. Therefore, the search space grows exponentially as the length of
1
the sequence increases, making an exhaustive search impossible; for a sequence of length 30, there are are 2030 1039 possible sequences. This makes it dicult to search for novel peptides with processes that involve experimental testing or computationally expensive methods such as molecular dynamics simulations. In addition, it is still impossible to accurately predict the complete structure of a peptide given its sequence. In applications in which researchers do not have an experimentally veried three dimensional structure of a known peptide to use as a template, statistical and machine learning approaches could be very useful in the peptide design process. We have built such a system for designing novel peptides and applied it to two problems in peptide design: inorganic binding and antimicrobial peptides.
1.2. Inorganic Binding Peptides. Peptides that are capable of binding inorganic materials such as metal or quartz have potential uses in a large number of applications. Many organisms found in nature incorporate inorganic materials into tissues such as such as bone, teeth, and shells. To build these structures, they use proteins and enzymes that direct their assembly at a nanoscale level [1]. Understanding and reverse engineering these processes could lead the way to many advances in engineering and medicine. For example, bone and tooth regeneration might be possible if we could recreate the chemical machinery that directs their formation. Other applications include the construction of nanoscale electronic and photonic devices. A rst step towards these goals is the discovery and understanding of peptides capable of binding to inorganic substances. High throughput combinatorial techniques such as phage display [2], in which a large number of bacteriophages are induced to express mutated peptides on their exteriors and then screened for ability to bind to a surface, have begun to provide data sets that can be used for statistical analysis of inorganic binding peptide sequences.
1.3. Antimicrobial Peptides. AMPs have attracted considerable attention as a potential new source of therapeutic agents eective against microorganisms that have developed resistance to traditional drugs [3]. Researchers have recently published several databases of AMPs, such as ADP2 [4] and CAMP [5], allowing the creation of large data sets for the computational analysis of AMPs. These data sets can be used to learn patterns in AMP sequences, which researchers can then use to design novel AMPs. For example, Wang et al. [4] used the most common motifs in the ADP2 database to hand-design a peptide that exhibited strong activity against E. Coli. 1.4. Overview of the Paper. In Section 2, I give an overview of the necessary components of a computational peptide design system, summarize previous approaches, and introduce our approach. In Section 3, I discuss the details of our approach, and our application of it to both the inorganic binding peptide and antimicrobial peptide design problems. Section 4 contains preliminary results, including an assessment of the performance of the classier used in our system, a description of the planned experimental verication of our generated inorganic binding peptides, and an assessment of designed AMP sequences used third-party computational prediction servers. I also examine the features identied by our models as most important for both design problems. In Section 5, I summarize our contribution and discuss future improvements. 2. PEPTIDE DESIGN APPROACHES 2.1. Introduction to Peptide Design Solutions. Computational peptide design approaches must have three components: a method for generating candidate sequences, a feature set to use to characterize sequences, and a method for scoring candidate sequences. The rst component must produce a set of novel sequences, either randomly or using a probabilistic
or rule based system based on a model of the desired class of peptides. The system must then extract a set of features for each sequence for use in the nal component, which is a scoring method that has been trained on known sequences. The complete system generates novel sequences, extracts their features, and scores them. Researchers then select the highest scoring candidates for experimental validation. Some approaches then use these experimental results to iteratively rene either their scoring model or their sequence generation model.
2.2. Past approaches. Oren et al. [6] used a computational method to search for novel peptides that bind to inorganic materials. For candidate generation, they randomly generated 1,000,000 peptide sequences. Using a clustering approach, they scored candidates based on their distances from sets of known strong and weak binders, as measured by global pairwise alignments. They trained their scorer, in this case the substitution matrix used in the alignments, using a greedy stochastic search. Because the score was based on sequence alignment, the feature set used in this approach was the raw amino acid sequence. The model successfully designed several strong and weak binding peptides. In another peptide design approach, a group of researchers at the University of British Columbia have built a scoring method for novel AMPs using articial neural networks [7, 8]. The authors created feature sets using two-dimensional quantitative structure-activity relationship (QSAR) descriptors. QSAR descriptors quantify the chemical structure of a peptide based on the the properties of amino acids in the sequence. To generate novel sequences for scoring, the authors sampled probability distributions that modeled the likelihood of a given amino acid appearing at a particular location in the sequence.
Loose et al. [9] generated novel AMPs using a linguistically inspired approach. To generate sequences, they built a set of regular grammars based on a database of known AMPs, and then exhaustively enumerated the language dened by those grammars. They then clustered candidate sequences and gave representatives from each cluster a score based on the number of known AMPs that shared a rule in the grammar with the candidate sequence. Socolich et al. [10] built a scoring model for WW domains, which are proteins with a particular type of fold, based on coupling constraints between positions in sequences aligned using multiple sequence alignment (MSA). For an example of a coupling constraint, consider an MSA in which the amino acid in a given position is always polar when the acid in another xed position is non-polar. The authors used a simulated annealing procedure to nd sequences which minimized the dierence between coupling constraints in the test set and the novel sequences, and experimentally veried several generated peptides. Thomas et al. [11] used the same feature set for this problem, but designed a more complex generation method. The authors trained a probabilistic graphical model (PGM) to learn the AA coupling constraints, which they encoded as edges in the model. To generate sequences from their PGM, the authors developed two sampling methods. They scored their sampled sequences based on their log likelihoods in the PGM. The approaches described above are summarized in Table 1. This summary demonstrates that it is possible to characterize peptide design approaches based on their sequence generation method, scoring method, and chosen feature sets. Analyzing approaches in this way may help to dierentiate the strengths and shortcomings of various approaches, and encourages a modular view of the problem. It may be possible in the future to combine candidate generation methods with scoring methods based on their results
Target Inorganic binding [6] AMPs [7] AMPs [8] AMPs [9] WW Domains [10] WW Domains [11]
Generation Method Random AA distribution AA distribution Regular grammars Simulated Annealing Sampling of PGM
Scoring Method Alignment score ANN ANN Grammar Matching Residue Correlation PGM Log likelihood
Feature Set Sequence QSAR QSAR Sequence MSA coupling MSA coupling
Table 1. Selection of recent approaches for computational peptide design. Abbreviations are AMP: Antimicrobial peptide; AA: amino acid; ANN: Articial neural networks; QSAR: Quantitative structure-activity relationship; MSA: Multiple Sequence Alignment; PGM: Probabilistic graphical model.
in similar peptide application domains. In the next section, we describe our approach, which features a feature set and scoring method designed to take advantage of recent machine learning techniques for biological sequences, and incorporates a novel generation method that we believe will be more ecient at suggesting new peptide sequence candidates than the methods described above. 2.3. Overview of Proposed Solution. We propose a method for de novo peptide design based on learning from n -gram counts of classes of amino acid residues, and then using weighted nite-state transducers to produce sequences that include those features that are strongly associated with the desired class of peptides. Feature mappings based on n -gram counts are analogous to representations of sequences in the frequency domain [12], and have been used successfully in tasks such as protein remote homology detection [13], the problem of detecting similarities between proteins from dierent organisms. In our application, we attempt to learn a set of weights that describe how each n -gram feature is associated with the target peptide class, and then use those weights to generate new sequences. The latter task is made dicult, however, by the fact that most vectors of n -gram feature counts do not represent valid peptide sequences. For example, a feature vector that has a positive count for a trigram feature must also have positive
counts for the bigrams and unigrams contained within the trigram; otherwise, it cannot represent a valid sequence. If the weight associated with the trigram feature is high, but the weight associated with the bigram features is low or negative, one cannot simply increase the count of the trigram feature while decreasing the count of the bigram features. Weighted nite-state transducers (WFSTs) are a potential solution for this problem in sequence design. Commonly used in speech recognition and natural language processing [14, 15], they can be used to build a weighted lattice of sequences that can be searched or sampled using ecient algorithms to yield sequences rich in a desired set of n -gram features, while still being valid sequences. 3. METHODS 3.1. Construction of Inorganic Binding Peptides Data Set. In our initial training phase, we used the 39 training examples from Oren et al. [6] to train our scoring model. All sequences were of length 12. Oren et al. characterized these sequences according to their quartz binding anity as either strong (10 sequences), moderate (14 sequences), or weak binders (15 sequences). We used the 10 strong binders as positive training examples and the 29 moderate and weak binders as negative training examples. We initially trained our model using these 39 examples, and then tested the scoring model on the 10 novel peptides generated by Oren et al. Because the data set was small, for the sequence generation phase of our approach we retrained on original training set of 39 plus the two strongest and two weakest of the 10 novel sequences, for a total of 43 sequences: 12 positive examples and 31 negative examples. 3.2. Construction of Antimicrobial Peptides Data Set. We downloaded all experimentally veried AMPs from the CAMP database as of March 8, 2010, yielding a set of 1,187 peptide sequences. After removing all

Sequences Features
(f1=1, f2=0, f3=1, f4=1, ...)
LPDWPPPQLYH
f1: Hydrophobic, Polar, Aliphatic

f3: Acidic, Aromatic, Cyclic f4: Buried, Hydrophobic ...
Figure 1. An example of how we map peptide sequences to features. The substring DWP contributes to the count of the trigram feature f 3, which represents subsequences of classes Acidic, Aromatic, Cyclic.
sequences that contained the nonstandard amino acid letters B, X, and Z, and then extracting representative sequences using the CD-HIT clustering server [16] with a sequence identity parameter of 0.9, we were left with a data set of 862 AMP sequences, with a mean length of 34 amino acid residues and a median length of 30. It is dicult to create a negative training set of experimentally veried non-AMPs, so we followed Thomas et al. [5] in noting that AMPs are generally secreted from cells, and downloaded a set of human protein sequences from the UniProt database that were between twenty and fty amino acids in length, not annotated as antimicrobial, and not annotated as secreted. This gave us a set of 1,224 negative training examples. We randomly split the data, putting 70% in a training set and 30% in a test set. 3.3. Extracting Features and Training an SVM Classier. To build a feature space and classier for our peptide sequences, we dene a set of 13 classes to represent the chemical properties of amino acid residues: acidic, cyclic, aliphatic, aromatic, basic, buried, charged, hydrophobic, large, medium, small, non-neutral, and polar. We dene unigram, bigram, and trigram features to be the ordered set of classes contained in a subsequence of one, two, or three amino acids. For each peptide sequence, we count the number of times each feature appears, as shown in Figure 1. Given a set of positive and negative training examples, we compute vectors of feature counts and train an SVM using the SVMLight package [17] V6.02.
V: CW R:
CWV-f-1
:f3
CWV-f-2
:f6
CWV-f-3
:f8
CWV-f-4
WV
f5/0.00492 f4/-0.00492 f3/0.00515 f2/-0.00479 f1/-4.401e-05
CWR-f-1
:f5
CWR-f-2
:f7
CWR-f-3
:f9
CWR-f-4
WR
(a) Feature Machine
(b) Scorer
Figure 2. Portions of the nite-state machines that generate new sequences. 2(a): A portion of the nite-state transducer that computes the list of features contained in a sequence, showing paths that can be taken from the state that represents a trigram history of CW. In one path the machine accepts the amino acid V as input, and emits the features f 3, f 6, and f 8, before proceeding to the state that indicates that the history is is now WV. On the other path the machine accepts the amino acid R, and emits the features f 5, f 7, and f 9. The symbol represents the empty string. 2(b): A nite-state acceptor that assigns scores to features.
Training an SVM produces a linear classier in the feature space, dened by
wT (x) + b, where (x) is the mapping of a sequence to the feature space and w is a weight vector that describes the decision boundary hyperplane in the feature space. It has been shown that the distance from a point in the feature space to the separating hyperplane can be mapped by a sigmoid function to an estimate of the probability that the point will belong to the positive or negative class [18]. Therefore, from our trained SVM we extract w, which indicates the direction in the frequency feature space that we hypothesize contain sequences that are more likely to be positive examples. 3.4. Building a WFST to Generate New Sequences. A weighted nite-state transducer is an automaton in which each transition between states is associated with an input symbol, an output symbol, and a weight.
10
Formally, they are dened as 8-tuples (, , Q, I, F, E, , ), where is an input alphabet, is an output alphabet, Q is a nite set of states, I is the set of initial states, F is the set of nal states, E is the set of transitions between states, is a weight function for initial states, and is a weight function for nal states. The input and output alphabets are augmented with a symbol which represents the empty string, allowing transitions to not input or output a symbol. If no outputs are associated with the transitions then the machine is referred to as a weighted nite-state acceptor (WFSA). Examples of WFSTs and WFSAs can be seen in Figure 2. An important operation on WFSTs is composition, denoted by . In composition, the outputs of one WFST are fed to the inputs of a second WFST or WFSA. Ecient implementations of composition allow the construction of complex models based on a set of simple machines [19]. We use WFSTs to generate novel peptide sequences that will score well according to the weight vector learned by our classier. To do so, we compose together three nite-state machines to build a WFST capable of generating our desired sequences. Our rst machine, F , is an unweighted transducer that maps from a sequence of amino acids to a sequence of tokens representing features, as shown in Figure 2(a). Our second machine, S , is a WFSA that provides a score for a given sequence of features. This machine is built using the weight vector from the SVM classier (Figure 2(b)). Each arc accepts a single feature token fi with a weight equal to wi , where wi is the value of the classier weight vector for that feature and is a parameter set when building the model. After applying a weight pushing algorithm and normalizing, the weights within the machine are treated as log probabilities; therefore, the parameter can be used to vary the peakedness of the
is a factor in probability distribution over generated sequences because wi
the probability of a sequence. The third machine, T , is a simple transducer
11
that accepts and outputs any sequence of length m; this machine constrains the length of our generated sequences. We use a value for m of 12 for the inorganic binding peptide problem, since 12 a standard length for designed inorganic binding peptides. For AMPs, we set m to 30, the median length of the AMP positive training set. We build our nal machine by composing the three sub-machines:
T F S We then determinize the paths through this transducer and normalize the scores of each path. This produces a WFST that accepts amino acid sequences of length m with a score given by summing up the individual weights of every feature contained within the sequence times . We build our nitestate machines using the open source OpenFST library, version 1.1 [20]. In addition to implementing the algorithms needed to produce the transducer as described above, OpenFST provides tools that can search for the highest scoring sequences accepted by the machine, and can sample from high-scoring sequences probabilistically, by treating the scores of each transition within the machine as a negative log probability. Random sampling adds diversity to our results, which is desireable because the highest scoring sequences generated by the model are often permutations of the same motifs.
3.5. Addition of a Language Model for Inorganic Binding Peptides. Because of the small size of the inorganic binding peptide training set, we wanted to ensure that our model did not overt and could generalize to produce novel chemically and biologically plausible sequences. To do so, we build a machine L, which models the probabilities of peptide sequences based on published sequences that are known to have some binding function.
12
This machine is a nite-state language model for amino acid sequences, of the type commonly used in speech recognition and language processing [15]. We create a training set for L by downloading all protein sequences from the Gene Ontology database [21] that are annotated with the term binding. Our current language model training set was downloaded from the 2010-01-10 seqdb release of the Gene Ontology database, which contains 187,269 sequences that are directly or indirectly associated with the term binding. By extracting uniquely named sequences from this release, we produced a training set of 64,894 unique protein sequences that we used to train our model. To build the language model we used a set of functions built on the OpenFST toolkit. The language model was built using an n gram order of 4, so that the probability of an amino acid appearing in the sequence is conditional on the previous three amino acids. We use WittenBell smoothing to estimate the probabilities of sequences of amino acids that do not appear in the training data. Finally, we rescale the language model probabilities using an additional parameter , by which we multiply each transition weight in L. Much like the parameter dened above, controls the peakedness of the probability distribution dened by the language model; we use it to increase or decrease the eect of the language model on the sequences generated by our system. In this case, we build our complete machine by composing the four sub-machines:
LT F S This produces a weighted nite-state acceptor that accepts sequences of 12 amino acids with a score given by summing up the individual weights of every feature contained within the sequence times . The probability of generating a sequence is aected by both the score of the sequence in the
13
SVM classier and the probability of the sequence according to the language model.
3.6. Parametrization of the Model and Generation of New Sequences. Because of the dierent validation methods available for the problems of inorganic binding peptide and AMP design, we used dierent approaches to set parameters and generate sequences for testing. For inorganic binding peptides, we had a very small training set and had to select a small number of designed sequences for experimental testing; therefore, we rst chose the parameters for our model based on observations of the sequences generated by a variety of parameter combinations. We then selected the highest scoring candidate peptides from a model built with the best combination of parameters. The existence of third-party computational prediction servers for AMPs, on the other hand, allowed us to generate and test large batches of sequences simultaneously, albeit without the certainty of experimental verication. Therefore, we generated large groups of sequences for several models, as well as groups of control sequences for comparison.
3.6.1. Selection of Inorganic Binding Peptide Sequences. In addition to the language model described in Section 3.5, we had two additional strategies for helping our model to generalize even with the small training set of inorganic binding peptides. First, we chose parameters which maximized the diversity of sequences generated as well as their scores within the model, so that we would be more likely to have a successful result among our selected set. Additionally, we used a distance constraint to lter out sequences that were too dierent from the known positive examples. To choose parameters, we sampled sets of 2000 sequences for each value of the model peakedness parameter in [1, 2, 3, 5, 6, 7, 8, 10, 11], and the
14
language model strength parameter in [0.75, 1, 1.5, 2, 3]. For each combination of parameters, we computed the average score for all 2000 sequences in the SVM classier, as well as the average score given by the QUARTZ2 matrix designed by Oren et al. [6]. Greater values of and produce sequences with higher scores in both scoring systems, as do smaller values of . However, the diversity of the sequences created decreases in both cases. Based on a subjective analysis of the scores and diversity of the sequences created, we chose a value for of 3 and a value for of 0.75; this set of parameters produced consistently high scoring sequences while maintaining population diversity in the generated set. POPDIV [22] is an algorithm for computing the population diversity of a sample of peptide sequences. Our nal parameter choice yielded a set of sequences with high scores but signicant population diversity as measured by POPDIV; with larger values of , population diversity drops quickly to insignicant levels. To select our nal candidates for testing, we added a nal ltering step to our process that ensured that our candidate sequences were similar to the positive training examples. We rst generated 100,000 random sequences. For each sequence, we calculated the euclidean distance between it and the centroid of the set of positive training examples in the feature space. In other words, we calculated the distance d for a candidate sequence x as:
|P | i=1 (Pi ) 2
d(x) =
(x)
|P |
where P is the set of positive training examples and is the mapping from a sequence to the n-gram feature space. By discarding sequences for which d(x) > for some , we constrained our search to nd sequences that lie within a region centered on the average positive training example. Based on comparisons of the scores and diversities of sequences at various distances,
15
we chose a value for of 40. We discarded all sequences with distances greater than 40, and then chose the 50 remaining sequences with the highest scores in our SVM classier as candidates for experimental validation. To further test the properties of our system, we also generated sequences with a value of -1. This inverts the weights and should produce peptide sequences that have little to no inorganic binding anity. To produce this set of predicted weak binders, we chose the 20 lowest scoring sequences from a sample of 100,000. We also included the lowest possible scoring sequence in the model, found using a best path search of the WFST: AIRGIRGIRGIR.
3.6.2. Creation of Sets of Antimicrobial Peptide Sequences. We created ve sets of novel AMP sequences for testing, each of which consisted of 2,000 novel peptide sequences of length 30. The rst, RAND, was a control set, consisting of amino acid residues chosen randomly at each position. As a second control, we also created a set AADIST, which was generated by setting each amino acid independently based on the distribution of amino acids in the CAMP database. We then sampled from our WFST to build three sets WFST1, WFST2, and WFSTNEG, with the peakedness parameter set to 1, 2, and -1, respectively. The group WFSTNEG exists to demonstrate the ability of the method to solve the inverse design problem: creating sequences which are unlikely to be AMPs. For each of our WFST generated groups, we veried that no sequences shared more than 0.9 sequence identity using the CD-HIT clustering web service [16].
3.7. Experimental Validation of Inorganic Binding Peptides. From our nal set of candidate inorganic binding peptides, our collaborators in the GEMSEC group at the University of Washington have selected several peptides for synthesis and experimental validation. Their choices are based
16
on factors such as diculty of synthesis and expected yield, properties of sequences that could not be learned computationally from the data available. 3.8. Third-party Prediction Servers for AMP Sequences. To evaluate the novel AMP sequences produced by our system, we used the prediction server provided by the CAMP database website [5]. Although it is not a substitute for experimental validation, using a computational prediction technique allows rapid testing of large sets of peptides, and is therefore useful in validating our approach. The CAMP server classies peptide sequences as AMPs using three methods: support vector machines, random forests (RF), and discriminant analysis (DA). Thomas et al. showed that these classiers perform well on a test data set, with overall accuracies of 91.5%, 93.2%, and 87.5%, respectively. The CAMP predictors use a feature set composed of a variety of features including amino acid composition, average hydrophobicity and hydrophilicity of the peptide, transition and composition of groups based on reduced amino acid alphabets, and di- and tri-peptide composition based on hydrophobicity. Therefore, there is partial but not complete overlap with the feature set used in our SVM classier and WFSTs. Even though this overlap may make it easier for our method to produce sequences that CAMP classies as AMPs, we believe that the construction of sequences that score highly in a third-party classication system is a valuable demonstration of our approach. 4. RESULTS 4.1. Performance of SVM Classiers. We began evaluating our system by testing the SVM classiers we trained for each peptide design problem on a held out test set. Because we used the feature weights from the SVM to build our WFSTs, our system depends on the SVMs ability to correctly classify unseen sequences as positive or negative examples.
17
2.5
Predicted Strong Binders Predicted Weak Binders
Dip position shift (nm)
0.0
0.5
1.0
1.5
2.0
0.4
0.2
0.0
0.2
0.4
Distance from hyperplane
Figure 3. Experimental binding scores of 10 test sequences from Oren et al. [6]. The sequences are arranged by their distance from the hyperplane score in the SVM classier. The Y axis shows the Dip position shift after six minutes as measured by Oren et al., a measurement of binding ability. Sequences that Oren et al. predicted to be strong binders are light lines; predicted weak binders are dark.
4.1.1. Inorganic Binding Peptides. When trained on the 39 training examples from Oren et al. [6], our classier successfully predicts strong and weak binders when evaluated against the 10 novel peptides reported in that paper (Figure 3). In addition, the distance from the hyperplane predicts a portion of the observed binding ability (R2 = 0.52, p = 0.02). Although this is a small data set, this strong performance is encouraging. 4.1.2. Antimicrobial Peptides. We evaluated the SVM classier from the AMP model against our held-out training set of 259 positive and 367 negative examples. Our classier had an area under the ROC curve of 0.931. This indicates that our feature set of n-gram counts of amino acid classes
18
RAND 2000 Classifier SVM RF DA
AADIST
WFST1
WFST2
WFSTNEG
Number of Positive Predictions
1500
1000
500
Figure 4. Number of sequences in a sample of 2000 predicted to be AMPs for dierent sequence generation methods. A value of 2000 indicates that all of the generated sequences were predicted to be AMPs; a score of 0 indicates that none were predicted to be AMPs. RAND: Sequences generated completely randomly. AADIST: Sequences generated according to the distribution of amino acids in the CAMP database. WFST1: Sequences generated using a WFST with a peakedness parameter = 1. WFST2: Sequences generated using a WFST with = 2. WFSTNEG: Sequences generated using a WFST with = -1. Results are shown for all three prediction methods available at the CAMP database server: SVM: support vector machine. RF: Random Forest. DA: Discriminant Analysis.
provides sucient information to train an SVM classier with strong performance in the AMP problem domain.
4.2. Validation of Generated Sequences. Our inorganic binding sequences are being tested experimentally by the GEMSEC group at the University of Washington at the time of this writing. However, we have been able to test our AMP sequences using computational prediction servers.
19
Feature Tryptophan Buried-Buried-Cyclic Buried-Cyclic-Aromatic Medium-Aromatic-Aliphatic Aliphatic-Medium-Aromatic Hydrophobic-Polar-Medium Polar-Small-Buried Hydrophobic-Medium-Aliphatic
Weight 0.027 0.022 0.019 0.019 0.018 -0.018 -0.020 -0.021
Table 2. Highest and lowest scoring features in the inorganic binding model
4.2.1. Antimicrobial Peptides. We used the CAMP prediction servers to determine the number of sequences predicted to be AMPs in our control and test groups. The results are shown in Figure 4. Averaged across the three prediction measures, only 3.9% of the random control group were predicted to be AMPs, while 59.6% of the AADIST control set had positive predictions. In the WFST1 group, an average of 84.9% were predicted to be AMPs. In the group created with a more peaked probability distribution over sequences, WFST2, an average of 99.9% of the generated sequences were predicted to be AMPs. Finally, in the WFSTNEG group, in which the value of the weights was reversed to reward non-AMP features, only 0.48% of the generated sequences were predicted to be AMPs. 4.3. Examining Highly Weighted Features in the Models. To better understand the predictions made by our model, we extracted the features with the highest and lowest weights for both peptide design domains. 4.3.1. Inorganic Binding Peptides. Table 2 shows the ve highest and three lowest weighted features in the inorganic binding peptide model. The importance of tryptophan in our model agrees with the results of Oren et al. [6]. An analysis of the trigrams with strong weights in the model may help us to better understand the chemical structures necessary for inorganic binding.
20
Feature Cysteine Isoleucine Lysine Aromatic-Buried-Small Medium-Medium-Aromatic Threonine Leucine Serine
Weight 0.428 0.279 0.246 0.169 0.159 -0.323 -0.356 -0.380
Table 3. Highest and lowest scoring features in the AMP model
4.3.2. Antimicrobial Peptides. The ve highest and three lowest weighted features for the AMP model are shown in Table 3. As with the inor-
ganic binding peptides, our model identied unigram features of amino acid residues that are overrepresented in AMP sequences; for example, the importance of cysteine residues in AMPs has been widely studied [23, 24]. Analysis of the trigram and bigram features identied by the model may yield insights into AMP structure and design.
5. DISCUSSION 5.1. Conclusions. We have shown that by using the n-gram features of chemical classes of amino acid residues and a trained SVM classier, we can produce WFSTs that are capable of generating novel sequences which share the same features as the training set. We have applied our solution to two problems in peptide design. The SVM classier component of our system performs well on known data sets of positive and negative examples in both problem domains. We have produced a set of inorganic binding peptide sequences that we are testing experimentally. We have also generated sets of novel AMP sequences, and a third-party classication server predicts that a large proportion of these novel sequences will have antimicrobial properties. By varying the parameters used to construct our machines, we can
21
exchange diversity of the generated sequences for a higher likelihood of generating novel peptides in a desired class. We believe that this framework is a promising approach for novel peptide design. By applying our system to both inorganic binding and AMP sequence design, we have shown that it may be generalizable to many problems in peptide design.
5.2. Future Work. Although we believe that the use of third-party computational prediction servers for AMPs shows that this is a promising approach, any attempt to computationally design novel AMPs must eventually be validated by synthesizing and testing actual peptides. We will look to do so in the future. We believe that this framework is generally applicable to many problems in peptide design. We originally designed the system for the inorganic binding peptide design problem, but found that we could easily adapt it to the design of AMPs with minimal changes. Therefore, we are optimistic that it could be easily generalized to other applications. One such problem is the design of peptides capable of binding to larger proteins, such as G-coupled protein receptors, which is an application with many implications in drug design. We would also like to study the impacts of using dierent feature mappings for sequences on the performance of the system. Our feature mapping is closely related to string kernel methods such as the spectrum kernel [13] and the mismatch kernel [25], and could easily be extended to incorporate features of other kernels, like the gappy and wildcard [26]. In fact, because our approach relates string kernels and weighted nite-state transducers, it should be possible to incorporate any rational kernel [27], a class of kernels dened in terms of WFSTs that includes most string kernels currently in use in computational biology.
22
6. ACKNOWLEDGEMENTS Kemal S onmez and Brian Roark provided the initial ideas for this project, and contributed invaluable advice during the course of its implementation. Kemal S onmez provided the initial implementation of the peptide sequence to feature mapping code. Brian Roark provided tools for building the UniProt language model. I am thankful for the willingness of Mehmet Sarikaya, Candan Tamerler, and the GEMSEC group at the University of Washington to test the novel inorganic binding sequences. I also thank the NSF for supporting part of this work through Grant #IIS-0811745. References
[1] C. Tamerler and M. Sarikaya, Molecular biomimetics: nanotechnology and bionanotechnology using genetically engineered peptides, Philos Transact A Math Phys Eng Sci, vol. 367, pp. 170526, May 2009. [2] R. H. Hoess, Protein design and phage display, Chem Rev, vol. 101, pp. 320518, Oct 2001. [3] H. Jenssen, P. Hamill, and R. E. W. Hancock, Peptide antimicrobial agents, Clin Microbiol Rev, vol. 19, pp. 491511, Jul 2006. [4] G. Wang, X. Li, and Z. Wang, APD2: the updated antimicrobial peptide database and its application in peptide design, Nucleic Acids Research, vol. 37, no. Database issue, p. D933, 2009. [5] S. Thomas, S. Karnik, R. S. Barai, V. K. Jayaraman, and S. Idicula-Thomas, CAMP: a useful resource for research on antimicrobial peptides, Nucleic Acids Research, vol. 38, pp. D77480, Jan 2010. [6] E. E. Oren, C. Tamerler, D. Sahin, M. Hnilova, U. O. S. Seker, M. Sarikaya, and R. Samudrala, A novel knowledge-based approach to design inorganic-binding peptides, Bioinformatics, vol. 23, pp. 281622, Nov 2007. [7] A. Cherkasov, K. Hilpert, H. Jenssen, C. Fjell, M. Waldbrook, S. Mullaly, R. Volkmer, and R. Hancock, Use of articial intelligence in the design of small peptide antibiotics eective against a broad spectrum of highly antibiotic-resistant superbugs, ACS Chem Biol, vol. 4, no. 1, pp. 6574, 2008. [8] H. Jenssen, C. D. Fjell, A. Cherkasov, and R. E. W. Hancock, QSAR modeling and computer-aided design of antimicrobial peptides, J Pept Sci, vol. 14, pp. 1104, Jan 2008. [9] C. Loose, K. Jensen, I. Rigoutsos, and G. Stephanopoulos, A linguistic model for the rational design of antimicrobial peptides, Nature, vol. 443, pp. 8679, Oct 2006. [10] M. Socolich, S. Lockless, W. Russ, and H. Lee, Evolutionary information for specifying a protein fold, Nature, vol. 437, pp. 5128, Jan 2005. [11] J. Thomas, N. Ramakrishnan, and C. Bailey-Kellogg, Protein design by sampling an undirected graphical model of residue constraints, IEEE/ACM Trans Comput Biol Bioinform, vol. 6, pp. 50616, Jan 2009. [12] D. Anastassiou, Frequency-domain analysis of biomolecular sequences, Bioinformatics, vol. 16, pp. 107381, Dec 2000.
23
[13] C. Leslie, E. Eskin, and W. S. Noble, The spectrum kernel: a string kernel for SVM protein classication, Pac Symp Biocomput, pp. 56475, Jan 2002. [14] B. Roark and R. Sproat, Computational Approaches to Morphology and Syntax. Oxford Surveys in Syntax & Morphology, Oxford: Oxford University Press, 2007. [15] M. Mohri, F. Pereira, and M. Riley, Weighted nite-state transducers in speech recognition, Comput Speech Lang, vol. 16, pp. 6988, Jan 2002. [16] Y. Huang, B. Niu, Y. Gao, L. Fu, and W. Li, CD-HIT suite: a web server for clustering and comparing biological sequences, Bioinformatics, vol. 26, pp. 6802, Mar 2010. [17] T. Joachims, Making large-scale SVM learning practical, in Advances in Kernel Methods - Support Vector Learning (B. Schlkopf, C. Burges, and A. Smola, eds.), ch. 11, Cambridge, MA: MIT Press, 1999. [18] J. C. Platt, Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, in Advances in Large Margin Classiers, pp. 6174, MIT Press, 1999. [19] F. C. N. Pereira and M. Riley, Finite-State Devices for Natural Language Processing. Cambridge, Massachusetts: MIT Press, 1997. [20] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, OpenFst: A general and ecient weighted nite-state transducer library, in Proceedings of the Ninth International Conference on Implementation and Application of Automata, (CIAA 2007), vol. 4783 of Lecture Notes in Computer Science, pp. 1123, Springer, 2007. http://www.openfst.org. [21] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. IsselTarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, Gene ontology: tool for the unication of biology. the Gene Ontology Consortium, Nat Genet, vol. 25, pp. 259, May 2000. [22] L. Makowski and A. Soares, Estimating the diversity of peptide populations from limited sequence data, Bioinformatics, vol. 19, pp. 4839, Mar 2003. [23] J. Dimarcq, P. Bulet, C. Hetru, and J. Homann, Cysteine-rich antimicrobial peptides in invertebrates, Peptide Science, vol. 47, no. 6, pp. 465477, 1999. [24] T. Ganz, Defensins: antimicrobial peptides of innate immunity, Nat Rev Immunol, vol. 3, pp. 71020, Sep 2003. [25] C. Leslie, E. Eskin, J. Weston, and W. Noble, Mismatch string kernels for SVM protein classication, Advances in Neural Information Processing Systems, pp. 1441 1448, 2003. [26] C. Leslie and R. Kuang, Fast string kernels using inexact matching for protein sequences, The Journal of Machine Learning Research, vol. 5, Dec 2004. [27] C. Cortes, P. Haner, and M. Mohri, Rational kernels: Theory and algorithms, The Journal of Machine Learning, vol. 5, pp. 10351062, Aug 2004.

Whelanch Rpe

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Whelanch Rpe

Transféré par

Droits d'auteur :

Formats disponibles

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

f1: Hydrophobic, Polar, Aliphatic

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

f5/0.00492 f4/-0.00492 f3/0.00515 f2/-0.00479 f1/-4.401e-05

(a) Feature Machine

Training an SVM produces a linear classier in the feature space, dened by

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

the probability of a sequence. The third machine, T , is a simple transducer

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

Predicted Strong Binders Predicted Weak Binders

Dip position shift (nm)

Distance from hyperplane

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

RAND 2000 Classifier SVM RF DA

Number of Positive Predictions

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

Feature Tryptophan Buried-Buried-Cyclic Buried-Cyclic-Aromatic Medium-Aromatic-Aliphatic Aliphatic-Medium-Aromatic Hydrophobic-Polar-Medium Polar-Small-Buried Hydrophobic-Medium-Aliphatic

Weight 0.027 0.022 0.019 0.019 0.018 -0.018 -0.020 -0.021

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

Feature Cysteine Isoleucine Lysine Aromatic-Buried-Small Medium-Medium-Aromatic Threonine Leucine Serine

Weight 0.428 0.279 0.246 0.169 0.159 -0.323 -0.356 -0.380

Table 3. Highest and lowest scoring features in the AMP model

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

DESIGNING PEPTIDES WITH WEIGHTED FINITE-STATE TRANSDUCERS

Vous aimerez peut-être aussi