Académique Documents
Professionnel Documents
Culture Documents
Peter D. Karp,
SRI International, Menlo Park, California
doi: 10.1002/9780470048672.wecb321
Advanced Article
Article Contents
Introduction Representing Metabolic Knowledge in PGDBS Metabolic Databases Computing with the Metabolism of a PGDB
The eld of pathway bioinformatics is concerned with the representation and the manipulation of metabolic information within computers. By organizing genome information into pathways, pathway bioinformatics places genes and their products into a mechanistic framework. This article describes how metabolic pathways are represented in a computer, and it describes the BioCyc (SRI International, Menlo Park, CA) collection of pathway/genome databases for several hundred organisms. Each BioCyc database describes the genome and the metabolic network of a single organism. This article describes computational algorithms for computing with pathway data. Pathway visualization algorithms help scientists comprehend this complex information space and facilitate analysis of large-scale omics datasets. Pathway analysis algorithms predict the metabolic network of an organism from its genome, identify the genes coding for missing enzymes in metabolic pathways, and enable the comparison of metabolic networks from multiple organisms. They also allow the prediction of the metabolic capabilities of an organism and identify potential drug targets within the metabolic network.
The eld of pathway bioinformatics is concerned with a range of problems related to the representation and the manipulation of metabolic information within computers. How do we capture our knowledge accurately about metabolic pathways and enzymes within the computer? How do we construct databases of metabolic information? How do we predict the metabolic pathways of an organism from its sequenced genome, and how do we compare the metabolic networks of two organisms?
Introduction
The bioinformatics subeld of pathway bioinformatics is concerned with developing computer representations of the metabolic network of an organism, with developing databases of metabolic information, and with developing algorithms for computing with metabolic information. This article will discuss the approaches for each of these problems in the Pathway Tools and BioCyc (SRI International, Menlo Park, CA) projects. Pathway Tools is a set of algorithms for computational analysis of metabolic data, and it includes computer representations of the metabolic network (1). BioCyc is a collection of Pathway/Genome Databases (PGDBs) for several hundred organisms (2). The BioCyc databases were constructed using Pathway Tools, and they can be queried and analyzed using Pathway Tools.
Together, Pathway Tools and BioCyc permit extremely fast and accurate modeling of the metabolic network of an organism from its genome sequence. Previously, hundreds of person-years of laboratory work were required to characterize an organisms metabolic map. Now, given an annotated genome sequence for the organism, its metabolic network can be predicted computationally within a few days. Manual review of that computational prediction will yield a more accurate result within a few weeks. Computational reconstructions of metabolism are not perfectly accurate; thus, for increased model delity, we recommend following the computational reconstruction with manual curation of the metabolic network. A manual curation effort surveys the past biochemical literature for an organism, tracks newly emerging literature on an ongoing basis, and updates the metabolic model within a PGDB to reect those ndings.
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
pathways that describe small, functionally linked subsets of reactions. Which approach is preferred? The answer is that both approaches have value, and they are not mutually exclusive; therefore, Pathway Tools supports both views of metabolism in a PGDB. Pathway Tools conceptualizes the metabolic network in three layers. The rst layer consists of the small-molecule substrates on which the metabolism operates. The second layer consists of the reactions that interconvert the small-molecule metabolites. The third layer is the metabolic pathways in which the components are the metabolic reactions of the second layer. Note that not all reactions in the second layer are included in pathways in the third layer because some metabolic reactions are not assigned to any metabolic pathway. Scientists who choose to view the metabolic network solely as a reaction list can operate on the second layer directly without interference from the third layer. But for a scientist for whom the pathway denitions are important, the pathway layer is available. The pathways in PGDBs are modules of the metabolic network of a single organism. Often, they are conserved across many species. These pathways are regulated as a unit (based on substrate-level regulation of enzymes, on regulation of gene expression, and on other types of regulation), and their boundaries are dened at high-connectivity, stable metabolites (3). PGDB pathways are dened based on pathways published in the experimental literature. More precisely, the compounds, reactions, and pathways in levels 13 are each represented as distinct database objects within a PGDB. That is, separate PGDB objects encode each metabolite, each metabolic reaction, and each metabolic pathway.
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
Trypsyn-Pwy tryptophan biosynthesis reaction-list in-pathway Rxn0-2382 indole + L-serine left appears-in-leftside-of Ser L-serine appears-in-rightside-of Trp L-tryptophan L-tryptophan + H2O right
Figure 1 Relationships that link levels 13 of the metabolic network. The metabolite tryptophan (ID TRP) is a reactant of the reaction whose ID is RXN0-2382, which in turn is a member of the pathway whose ID is TRYPSYN-PWY. The eld IN-PATHWAY is the inverse of the eld REACTION-LIST.
within Pathway Tools through SRIs online registry of PGDBs (http://BioCyc.org/registry.html ). The overall framework of BioCyc is to dene a single foundational database of experimentally elucidated pathways from many organisms (MetaCyc; SRI International, Menlo Park, CA) that is used to predict the metabolic pathways of other organisms from their sequenced genomes. Each prediction is modeled as a single organism-specic PGDB. Thus, the BioCyc organism-specic PGDBs each model the metabolism of a single organism in detail, whereas MetaCyc captures well-dened pathways from many organisms but does not dene a comprehensive model of the pathways of any organism [except for E. coli , because MetaCyc contains all metabolic pathways from the EcoCyc (SRI International, Menlo Park, CA) PGDB]. BioCyc is divided into three tiers that reect the degree of manual curation of these databases. The Tier 1 PGDBs EcoCyc and MetaCyc have undergone more than two person decades of curation each. By curation, we mean effort on the part of biologists to read the biomedical literature and to enter information from publications into these PGDBs.
Tier 1: EcoCyc
EcoCyc (4) describes the genome, the metabolic pathways, and the transcriptional regulatory network of E. coli K-12. EcoCyc curators enter newly discovered functions of E. coli genes into EcoCyc, as reported in the literature. They also enter E. coli metabolic pathways and information about E. coli operon organization, promoter locations, and control of those promoters by binding of transcription factors to nearby DNA sites. EcoCyc contains a written summary of the function of every E. coli gene for which experimental information is available. The information in EcoCyc was obtained from the more than 14,000 publications cited by EcoCyc.
Metabolic Databases
The preceding conceptual structure underlies all PGDBs created by Pathway Tools. Those PGDBs fall into several categories. The BioCyc collection of PGDBs is a collaboration between the Bioinformatics Research Group at SRI International and the Computational Genomics Group at the European Bioinformatics Institute (2). In addition, many other PGDBs have been created by other users of Pathway Tools. Some are listed in a table on the BioCyc home page (http://BioCyc.org ). These PGDBs can be accessed through the web sites operated by their creators, and in some cases they are available through the BioCyc web site. In addition, some PGDBs can be downloaded for local use
Tier 1: MetaCyc
MetaCyc (5) is a multiorganism encyclopedia of metabolic pathways and enzymes. Like EcoCyc, it contains literature-derived information on experimentally elucidated metabolic pathways and enzymes. MetaCyc version 10.5 (October 2006) contains all 3
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
in-pathway Rxn0-2382 indole + L-serine L-tryptophan + H2O left appears-in-leftside-of Ser L-serine enzrxn right reaction appears-in-rightside-of Trp L-tryptophan
EnzRxn0-3701 enzyme
components Ecoli-K12-Chromosome
Figure 3 Relationships that link the genome and proteome of a PGDB to the metabolic network.
197 metabolic pathways from EcoCyc and all EcoCyc metabolic enzymes. MetaCyc includes another 600 metabolic pathways from other organisms. Approximately half of the pathways in MetaCyc are from microorganisms, and approximately one-third of the pathways are from plants, with the remainder that comes largely from animals. The metabolic pathways in MetaCyc were elucidated experimentally in more than 700 organisms, and the information in MetaCyc has been drawn from more than 10,000 publications. MetaCyc contains extensive mini-review summaries and literature citations in its pathways. It also contains enzyme entries to explain the biologic functions of pathways and enzymes as well as to make this information accessible to scientists who are not experts in each pathway and enzyme. The Tier 2 and Tier 3 PGDBs were derived computationally by applying the following sequence of computational operations 4
to the annotated genomes of each organism, as described in more detail in the next section, and in Reference 2.
1. 2. 3. 4. 5.
The annotated genome of each organism was converted to PGDB format. The PathoLogic program predicted the metabolic pathway complement of each organism. The PHFiller program predicted which genes within the organism will code for missing enzymes within the predicted metabolic pathways. An operon predictor was executed for the bacterial genomes. A cellular overview diagram was computed for each organism.
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
Tier 2
The Tier 2 PGDBs were created computationally by the preceding methodology, and then some amount of manual curation was applied to these PGDBs. For example, after being created computationally, the HumanCyc (SRI International, Menlo Park, CA) PGDB (6) received extensive curation to assign human metabolic enzymes manually to their associated reactions; to enter 10 metabolic pathways and their enzymes from the literature into HumanCyc; and to enter associated summaries, literature citations, and other information such as enzyme regulators, cofactors, and subunit structure.
Tier 3
The Tier 3 PGDBs were created computationally by the preceding methodology, with no subsequent manual curation. We encourage scientists to adopt Tier 2 and Tier 3 PGDBs for ongoing curation and renement. No single group can curate all the worlds genomes, so we encourage experts of the biology of an organism to assume responsibility for updating its PGDB to reect existing and emerging information in the literature, on an ongoing basis.
of the enzymes within a pathway, and all components of the drawing are clickable by the user. For example, clicking on a metabolite takes the user to a page that shows the metabolite structure and lists all its synonyms, all reactions and pathways in which it is a substrate, and all enzymes whose activities it regulates. Pathway Tools also generates information pages for enzymes and for biochemical reactions. Pathway Tools can generate a visualization of the entire metabolic network of an organism, which we call the cellular overview diagram (7). This diagram is generated automatically from any PGDB, and it depicts all metabolic pathways in the PGDB as well as reactions not assigned to any pathway and all transporters identied in the PGDB. The overview diagram can be used to visualize omics datasets in a mode of operation called the Omics Viewer (7). The input to the Omics Viewer is a combination of gene expression data, proteomics data, metabolomics data, or other measurements that associate numbers with genes, reactions, or metabolites. The numbers are mapped to colors that are painted onto the elements of the cellular overview to allow the power of the human visual system to be used to interpret large-scale datasets in a pathway context. For example, a dot in the diagram that represents a single metabolite would be assigned a color that indicates the measured concentration of that metabolite in a metabolomics experiment. Finally, Pathway Tools can generate a poster-size version of the cellular overview complete with labels for entities in the diagram.
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
A nal inference tool provided by Pathway Tools is called the pathway hole ller. A pathway hole is a reaction in a metabolic pathway for which no enzyme has been identied in the genome that catalyzes that reaction. Typical microbial genomes contain 200300 pathway holes. Although some pathway holes are probably genuine, we believe that most probably result from the failure of the genome annotation process to identify the genes that correspond to those pathway holes. For example, genome annotation systems systematically under-annotate genes with multiple functions, and we believe that the enzyme functions for many pathway holes are unidentied second functions for genes that already have one assigned function. The method used by the pathway hole lling program PHFiller (8) is as follows. Given a reaction that is a pathway hole, the program rst queries the UniProt database to nd all known sequences for enzymes that catalyze that same reaction in other
organisms. The program then uses the BLAST tool to compare that set of sequences against the full proteome of the organism in which we are seeking hole llers. It scores the resulting BLAST hits by considering information such as genome localization, that is, is a potential hole ller in the same operon as another gene in the same metabolic pathway? At a stringent score cutoff, our method nds potential hole llers for approximately 45% of the pathway holes in a microbial genome.
Figure 4 Pathway comparison of B. anthracis Ames and E. coli K-12. Rows 1 and 2 of the table indicate that these organisms contain 142 and 114 biosynthetic pathways, respectively, of which 8 and 6 pathways are for biosynthesis of amines and polyamines, respectively.
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
Figure 5 Detailed comparison of pathways of biosynthesis of lipids and fatty acids of B. anthracis and E. coli . This report indicates the presence of specic named pathways in each organism with an X.. Clicking on the name of the pathway will display the pathway itself.
pathways of small-molecule metabolism, which contain 702 component reactions. Another 245 reactions of small-molecule metabolism are not assigned to a specic pathway. One hundred thirty-ve reactions in E. coli are catalyzed by more than one enzyme. Conversely, 177 E. coli enzymes are multifunctional, meaning they catalyze more than one reaction. Nine hundred seventy-ve metabolites in the E. coli metabolic network. Each reaction contains an average of 4.1 metabolites, and each metabolite is a substrate in 5.2 reactions, on average. Most pathways are 17 reactions in length, but the longest pathway contains 20 reactions. Interestingly, substrate-level inhibition of enzymes is more than four times more common than is substrate-level enzyme activation92 enzymes have recorded inhibitors, whereas 21 enzymes have recorded activators, for a total of 97 enzymes in the metabolic network that have some type of known substrate-level regulation. A computer formulation of metabolism also facilitates comparisons of the metabolic networks of two or more organisms. The cellular overview diagram can be used for comparative purposes by coloring those metabolic reactions shared between two organisms by using the desktop version of Pathway Tools. The Web version of Pathway Tools provides a suite of comparative analysis tools. For example, Fig. 4 shows comparisons of the overall pathway complements of E. coli and B. anthracis , which is broken down according to the Pathway Tools ontology of pathways. Figure 5 shows a detailed comparison of the pathways of biosynthesis of fatty acids and lipids in these two organisms. A second form of pathway analysis is computing the potential outputs that the metabolic network might produce when supplied with a set of input metabolites (10). A third computational analysis method predicts choke points in the metabolic network, which are enzymes that if inhibited would be likely to create a major bottleneck in the metabolic network, and they are therefore likely to be good targets for developing antimicrobial
drugs (11). In addition, it is possible to compute the equilibrium ux rates through an entire metabolic network (12).
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
4. 5. 6.
7.
Acknowledgment
I thank Carol Fulcher and Alexander Shearer for comments on this manuscript. This work was supported by grants GM077678, GM077905, and GM70065 from the National Institutes of Health.
8.
9. 10. 11.
References
1. Karp PD, Paley S Romero P. The Pathway Tools software. Bioinformatics. 2002;18:225232. 2. Karp PD, et al. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res. 2005;33:60836089. 3. Green ML, Karp PD. The outcomes of pathway database computations depend on pathway ontology. Nucleic Acids Res. 2006;34:36873697.
Keseler IM, et al. EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res. 2005;33:334337. Caspi R, et al. MetaCyc: A multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res. 2006;34:511516. Romero P, et al. Computational prediction of human metabolic pathways from the complete human genome. Genome Biol. 2005;6:2. Paley SM, Karp PD. The Pathway Tools cellular overview diagram and Omics Viewer. Nucleic Acids Res. 2006;34:37713778. Green ML, Karp PD. A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformat. 2004;5:76. Ouzounis CA, Karp PD. Global properties of the metabolic map of Escherichia coli. Genome Res. 2000;10:568576. Romero PR, Karp P. Nutrient-related analysis of pathway/genome databases. Pac Symp Biocomput. 2001; 471482. Yeh I, et al. Computational analysis of Plasmodium falciparum metabolism: organizing genomic information to facilitate drug discovery. Genome Res. 2004;14:917924. Varma A, Palsson BO. Metabolic ux balancing: basic concepts, scientic and practical use. Bio/Technology, 1994;12:994998. Krummenacker M, et al. Querying and computing with BioCyc databases. Bioinformatics. 2005;21:34543455. Lee TJ, et al. BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics 2006;7:170.
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.