Vous êtes sur la page 1sur 15

An introduction to the Pfam protein families database

doi:10.1038/pid.2011.3

Standfirst Pfam is a database of conserved protein domain families that is widely used by biologists to annotate and classify proteins. The database comprises two classes of entries: (i) Pfam-A families, which consist of a seed alignment, a hidden Markov model, full alignments, associated annotation, literature references, and database links; and (ii) Pfam-B families, which are automatically generated alignments of sequence clusters, derived from the Automatic Domain Decomposition Algorithm (ADDA) database, with no annotation, that supplement the Pfam-A families. Users are able to search their sequences against libraries of both sets of entries to gain functional insight into their proteins of interest. Pfam is available via servers in the United Kingdom (http://pfam.sanger.ac.uk/), the United States (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).

Introduction

The number of nucleic acid and protein sequences being deposited in the sequence repositories has been increasing exponentially with time. Due to the limited resources of people, time and money, only a tiny proportion of these sequences has been experimentally characterized. Resources that can facilitate the transfer of information from characterized proteins to uncharacterized homologous proteins are therefore invaluable.

Pfam, a comprehensive database of over 13,000 conserved protein families, is one such resource. It allows accurate transfer of annotation from functionally and sometimes also structurally characterized proteins to proteins of unknown function. It is used extensively by biologists for classifying and annotating proteins and for analyzing proteomes. Pfam has also proved a useful tool for the structural biology community, aiding in identifying interesting new targets for structure determination.

Pfam is composed of two types of families, Pfam-A and Pfam-B. Each manually curated Pfam-A family has a seed alignment, which contains a set of representative sequences for that family, from which we generate a profile hidden Markov model (HMM) using the HMMER3 package (http://hmmer.janelia.org/). Each Pfam-A profile HMM is searched against a primary sequence database

that is based on UniProtKB.1 Additional searches are made against the NCBI GenPept database2 and a set of metagenomic sequences.3 All sequences scoring above curated thresholds, which are set conservatively to avoid the inclusion of false positives, are included in the full alignments for that family. Pfam-A families are also annotated with structural and functional information if available and are crosslinked to other relevant databases.

At every new release of the Pfam database, the underlying sequence databases are updated with the most recent versions. To improve coverage, we generate automatically a second group of families, Pfam-B families, using protein-sequence clusters produced by the ADDA database.4 Pfam-B families are derived from sequence segments in ADDA that do not match any Pfam-A regions. They are thus supplemental to the Pfam-As and give an indication of additional conserved regions. Pfam-B families have an associated alignment but do not have any annotation or literature references. An HMM is generated for each Pfam-B to enable searches to be run against them. Unlike Pfam-A alignments, PfamB alignments have not been manually checked for quality by a Pfam curator. A full explanation of the arrangement of data relating to a Pfam-A family is given in the Supplementary Information, but we will give little more on Pfam-B families except to say they may be indicative of potential new domains. In Pfam release 25.0, 76.7% of sequences in UniProtKB have at least one match to a Pfam-A family, and Pfam-B families provide an additional coverage of 7.9%.

Finally, we group together Pfam-A families into higher level collections, known as clans. A clan contains those families that we believe to have a common evolutionary ancestor. Several lines of evidence are used to determine whether or not two families are related. Where structural data are available, we use these to see if the families adopt the same overall structure. In the absence of structural data, we use the profile-comparison tools SCOOP20 and HHsearch15 and also look to see if the HMMs match any of the same sequence regions. Each Pfam-A family can only be classified into one clan, and, currently (Pfam release 25.0), approximately 28% of all Pfam-A families belong to a Pfam clan. There are at present 458 clans in Pfam.

Here we will describe key entry points into the database, how searches may be carried out and the results interpreted, and some interesting features of the website. This is followed by a couple of examples of usage illustrating how to carry out further investigations of a protein sequence to answer specific biological questions.

Detailed information on how the data in Pfam are generated, stored and presented to the user either on the web page or as a download for local use, is found in Supplementary Information, along with some additional search tools.

How to Enter the Pfam Website

The Pfam website has a number of entry points: through keyword searches; via sequence, family, clan, or structure identifiers or accessions; and by searching protein sequences of interest against Pfam HMMs.

An overview of the main entry points and results pages on the Pfam website is given here. For further details on how the information in the sequence, family, clan and structure pages is laid out and details on how to use the website more fully, please refer to the Supplementary Information section.

Figure 1 shows a typical family page, with labels highlighting some of the navigation links and key information that is used on many of the pages within the website. The family page for a Pfam-A family contains the functional annotation at the top of the page, derived either directly from the Wikipedia (http://www.en.wikipedia.org/wiki/Piwi) entry for that family if one is available or from Pfam or InterPro. At the side of the page are a number of tabs, each relating to a different set of data. This tab format is used throughout the website.

Figure 1

A typical Pfam family page. At the top of every Pfam page are six links (Home, Search, Browse, FTP, Help and About), each of which is described in the main text. The Jump to box and the Keyword search are available on most Pfam pages. The context-specific icons at the top of the page indicate the number of different architectures, the number of sequence segments in the full alignment, the number of documented interactions, the number of different species represented in the full alignment and the number of PDB chain IDs that can be displayed for the family concerned. Each tab gives access to different data, details of which can be found in the main text. As illustrated here for the Piwi family (Pfam accession PF02171), this Summary tab displays the Wikipedia article. The Clans tab is not grayed out, indicating that Piwi is part of a clan.

Full figure and legend (104K)

Figures & Tables index

Pfam keyword search: The Keyword search at the top right-hand corner of every Pfam page allows entry by free text such as with the word 'apoptosis'. The results of the search will display a list of Pfam-A families that match the search term(s) entered, and clicking on any one opens the relevant family page.

Pfam sequence accession or identifier: Enter a sequence accession or identifier either in the View a sequence box on the Pfam home page or in the Jump to boxes found on most web pages to see the precalculated Pfam domain composition for the sequence of interest. The types of accessions and identifiers that are allowed are given below:

UniProtKB accession or identifier (for example, P15498 or VAV_HUMAN)

NCBI GI number or secondary accession (for example, 349163 or AAA72015.1)

Metagenomics identifier or accession (for example, ECP38001.1 or 141063662)

After entering the sequence identifier (ID) or accession, you will be taken to a page showing the Pfam domain composition for that protein. Figure 2 shows the domain composition for a real example sequence, and Figure 3 shows all possible graphics that might be used to represent different features of a sequence as employed throughout the Pfam website.

Figure 2

A typical Sequence page. This is the sequence page reached by entering Pfam with the accession or identifier for AURE_STAAU (UniProtKB: P81177), the second sequence displayed by default for the domain organization for the sequences in the full alignment of the PepSY domain (Pfam accession PF03413) of Figure 10. The page shows all the summary information from UniProtKB with description, source organism and length, and then the Pfam domain graphic indicating visually the relative positions

of the different Pfam domains and regions on the sequence, and in tabular form the start and end residues. Explanations of the individual domain graphics are given in the legend to Figure 3.

Figure 10

Domain organization for the PepSY family (Pfam accession PF03413). The domain organization listing of all the unique domain organizations or architectures in which the PepSY domain is found is shown. The first graphic is that of the most abundant combination and is drawn on the longest sequence carrying that domain. Each row contains the following information: the number of sequences that exhibit this architecture; a textual description of the architecture (e.g., PepSY, Peptidase_M4); a link to the sequence page for the sequence in this graphic; the UniProtKB description of the protein sequence; the number of residues in the sequence; and the Pfam graphic itself. The show/hide button toggles between viewing the single example and viewing all (629 in the first case) examples of a particular combination.

Full figure and legend (75K) Figures & Tables index

Full figure and legend (126K) Figures & Tables index

Figure 3

An illustrative domain graphic of all sequence features. This figure shows all the possible elements that might be present on a sequence; Pfam-A families and domains are represented by lozenge shapes, whereas Pfam-A repeats and motifs are shown as rectangles. See the Data Model section of Supplementary Information for the definition of each type. The match within the alignment coordinates is blocked in solid color, and that between the envelope coordinates appears in a lighter shade of the same color; where there is a partial match to a HMM, the edge of the shape is jagged. Pfam-B families

are shown as a rectangle with three stripes. Disulfide bridges are marked with a gray line and nested domains with a black line. Active-site residues are depicted with a diamond-head lollipop, and metalbinding residues are depicted with a circle or square-headed lollipop. A circular head denotes that the metal-binding residue has been determined experimentally, whereas a square head means the metalbinding residue is predicted. The color of the lollipop head on metal-binding residues denotes which ion the residue binds, and the identity of the ion can be found by mousing over the lollipop of interest. Positions for signal peptide, coiled coil, transmembrane and low-complexity regions are shown as semitransparent rectangles in the colors orange, green, red and blue, respectively. Tool tips are available for all elements of the graphic.

Full figure and legend (34K) Figures & Tables index

Pfam family accession or identifier search: Another means of accessing a Pfam family page is by entering a Pfam-A or Pfam-B family ID or accession in the View a Pfam family box on the Pfam home page or in the universal Jump to boxes. Examples of Pfam family identifiers and accessions are PF00001, 7tm_1; PB000001 and Pfam-B_1.

Pfam clan accession or identifier search: A clan ID or accession can be entered in the universal Jump to boxes or in the View a clan box on the home page (for example, CL0088 or Alk_phosphatase) to access a Pfam clan page. Figure 4 is a typical example of a Pfam clan page listing all member families in the clan.

Figure 4

A typical Clan page. The Summary tab on the Clan page for the Pfam clan Alk_phosphatase (Pfam clan accession CL0088) shows the summary function, the literature references, the clan members, external database links and one representative PDB structure from a drop-down list of all structures in the clan. The various general and clan-specific tabs are shown on the left-hand side.

Full figure and legend (129K) Figures & Tables index

Search Pfam with a PDB identifier: Enter the PDB identifier in the universal Jump to boxes or the View a structure box on the Pfam home page to access a structure page. Figure 5 shows a structure page with, in this case, The Open Protein Structure Annotation Network TOPSAN wiki (http://www.topsan.org/) summary information for that structure.

Figure 5

A Structure page with TOPSAN entry. The Summary tab of the Structure page for PDB:3due, the structure common to the PepSY clan, gives summary details of the experimental procedures, then a list of links to external structure databases, and finally the information about the structure from TOPSAN. Structure-specific tabs on the left link to the PubMed Central literature reference if available (see Figure 18), to the domain organization (i.e., a graphic of all domains on the sequence), the sequence mappings of these domains, and the different viewing possibilities for the structure.

Figure 18

A Structure page with a BioLit entry. The Literature tab of the Structure page for PDB:1jtg shows that the BioLit project has tagged one reference from PubMed Central. The full title, author list, journal reference, abstract and external links of this article are shown along with the figures found in the article.

Full figure and legend (203K) Figures & Tables index

Full figure and legend (152K) Figures & Tables index

Searching with a protein sequence: The major feature of the Pfam website is the facility for searching a protein sequence against the libraries of Pfam-A and -B profile HMMs to find a family with which the sequence is homologous and hence to inform on its function. The next section gives more details on how to perform a search with a protein sequence.

Sequence Searches

A sequence search is useful in cases where a protein is not present in Pfam, or when the user wishes to identify if a domain might be present on a protein already in the database but with a score below the predetermined threshold for that family. In both cases the user can paste the sequence into the Search box as indicated below and set the thresholds accordingly.

For each family, a Pfam curator has determined manually the optimal score (called a gathering cutoff threshold or GA) that a sequence must attain for its match to be deemed significant. Sequences that score above the GA are included in the alignment for the corresponding sequence database, thus generating a UniProtKB full alignment (this is the Pfam-A full alignment), an NCBI GenPept alignment and a metagenomics alignment for each Pfam-A family. These thresholds are set conservatively to ensure that no known false positives are included. The Data Model section in the Supplementary Information gives a full explanation of Pfam scores and thresholds, and explains how Pfam-A families are built.

How to perform sequence searches on the website

There are two main ways to search for a match to a protein sequence on the Pfam website: (i) by an interactive single-sequence search or (ii) by a batch search wherein the user uploads a file containing up to 5,000 sequences, following which the results are emailed back to the user. Both types of search can be accessed via the Search link at the top of the Pfam home page by clicking on the Sequence tab on the left to perform a single-sequence search or on the Batch search tab to perform a batch search. For both types of searches, there are options to set an alternative threshold for detecting weak matches, and to include searches against the Pfam-B as well as against the Pfam-A libraries of HMMs. The results of a single-sequence search are shown in Figure 6, with the important features highlighted. Active-site residues are predicted for both types of searches using an in-house homology-derived prediction algorithm.5

Figure 6

Results page from searching a protein sequence and requesting both Pfam-A and Pfam-B matches. A graphic of the domain architecture is displayed in the top panel of the Results page, significant hits to Pfam-A appear in the next panel down, insignificant (below threshold) matches in the following panel, and hits to Pfam-B in the last panel. The option to display the residue-by-residue scores is also available via the show/hide button. When the alignment is 'shown', the #HMM line shows the consensus of the model, with capital letters representing the most conserved (high information content) positions, and dots (.) indicating insertions in the query sequence with respect to the model. Identical residues are colored cyan, and similar residues are colored dark blue; the #MATCH line indicates matches between the model and the query sequence, where a + indicates positive score, interpretable as conservative substitution with respect to what the model expects at that position; the #PP line represents the posterior probability (essentially the expected accuracy) of each aligned residue, where a 0 means 05%, 1 means 515%, and so on to 9 meaning 8595% and a * meaning 95100% posterior probability (pp); the #SEQ line is the query sequence, colored according to the pp for each residue match on a scale from bright green for * through paler green and pale red down to bright red for 0.

Full figure and legend (109K) Figures & Tables index

In releases before 24.0, matches to Pfam-B families were detected by performing a BLAST search of the query sequence against a FASTA file of all the Pfam-B regions. We now search the query sequence against a library of HMMs built from Pfam-Bs 120,000 with a default E-value cutoff of 0.001, which has improved both the sensitivity and the specificity of the Pfam-B searches as compared with the older method.

How to perform sequence searches locally

Users also have the option to perform Pfam searches locally by downloading the Perl script pfam_scan.pl, Pfam Perl modules and the Pfam data files, all of which are available from the Pfam ftp site. To run the script, users will also need to install HMMER3 (http://hmmer.janelia.org/) and a few modules from CPAN. Full details of how to get pfam_scan.pl up and running can be found on our ftp site (ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools/README).

How to understand search results

Scores: Using the default search parameters on the website will result in matches being reported that have an E-value less than or equal to 1.0. This is a lenient threshold, and all resulting Pfam-A matches that satisfy this E-value threshold are then labeled as significant or insignificant, depending on whether they fall above or below the GA for that Pfam family (see Figure 6). We are confident that a significant match to a family is real, whereas an insignificant match, which falls below our GA threshold but above the user-defined threshold, might indicate either low similarity (distant match) or that the match has occurred by chance (a false positive).

The user is also able to specify either their own Pfam-A E-value threshold or to opt for the Pfam-Acurated GA thresholds when performing their searches. For single-sequence searches on the website, a graphical view of the search results is shown in the top block of the results page. The user can choose to display the match scores by clicking the Show button in the Show or Hide alignment column, and the meanings of the color schemes and the resulting lines displayed are given in the paragraph at the top of the page revealed by clicking on Show as indicated above (Figure 6, bottom lines).

The default Pfam-A search parameters for local searches using the pfam_scan.pl script are the Pfam-Acurated GA thresholds. A default E-value cutoff threshold of 0.001 is used for the Pfam-B searches. Because Pfam-B families are generated automatically, matches found to Pfam-Bs should be treated with a degree of caution when deciding whether they really correspond to a biologically meaningful conserved region of sequences or not.

Alignment and envelope coordinates: Each sequence match to a Pfam HMM will have two sets of coordinates: the alignment coordinates and the envelope coordinates. The envelope coordinates indicate the region on the sequence over which the match lies, whereas the alignment coordinates indicate the region over which the alignment confidence is high. Graphically, the alignment coordinates are depicted with a solid color and the envelope coordinates in a lighter shade of the same color. When the region within the envelope coordinates does not match the entire length of a HMM, the match is said to be partial; graphically, this is drawn with a jagged edge at the N or C terminal or both, depending on which region of the match is incomplete (see Figure 2).

Two Examples Using Pfam to Answer Biological Questions

Determining function from structure

Supposing a user has submitted a piece of metagenomic sequence to the sequence search and has found a significant match to a domain of bacterial sequences that is designated a domain of unknown function, such as DUF2874 (Pfam accession PF11396). This finding in itself might be thought unfruitful insofar as determination of possible function goes; however, much information can be gleaned by following up the additional information that is available for this family on the family page.

First, DUF2874 has been assigned to the PepSY clan (Pfam clan accession CL0320); second, there are several known structures for members of the DUF2874 family; and third, the domain composition or organization indicates that sequences can contain up to five copies of this domain.

Three of the other four members of the PepSY clan are all families of bacterial periplasmic proteins whose structurally determined members are known to adopt a BLIP- or -lactamase inhibitor protein fold and whose domains come in multiple copies. Although the precise function of most of the clan member families is not known, BLIP is known to function as a broad-spectrum inhibitor, especially of lactamase, and to act as a regulator of cell morphology and sporulation. Examination of the domain compositions of all clan member families shows them to have a signal peptide in common, and, although DUF2874 is not commonly associated with other domains, PepSY (Pfam accession PF03413), one of the members of the PepSY clan, is frequently found associated with metallopeptidases. All these factors together help to build up a picture of a metagenomic protein sequence that may be acting as a bacterial enzyme inhibitor or lipoprotein inhibitorpossibly of a peptidase.

Figure 7 shows the structural similarities between DUF2874 (Pfam accession PF11396), BLIP (Pfam accession PF07467), SmpA_OmlA (Pfam accession PF04355) and the PepSY (Pfam accession PF03413) families, as well as indicating the sequence similarities between families DUF2874 and PepSY, between BLIP and SmpA_OmlA, and between SmpA_OmlA and DUF3192 (Pfam accession PF11399).

Figure 7

Example interrelationship of clan members shown by structure. Representative structures from each of the four clan members of the clan PepSY (Pfam clan accession CL0320) have been drawn in such a way that the common unit of helix plus -sheet (boxed) can be identified as being present in all. Arrows indicate similarities of sequence as well as of structure between individual clan members. (This figure has been reproduced from Das et al.18).

Full figure and legend (72K) Figures & Tables index

From disease via phylogenetic tree to model organism

Supposing the user is interested in the disease Best macular dystrophy. This three-word term can be entered into the Keyword search box at the top of the Pfam home page, and the user will be taken directly to the page for the family Bestrophin (Pfam accession PF01062). The summary, derived from the listed literature references, gives some starting details, explaining that the disease in humans is characterized by a depressed light peak in the electro-oculograma measurable parameterindicative of visual impairment.

Choosing the phylogenetic tree from the Trees tab on the Bestrophin family page shows that among the sequences in the seed alignment there are several human sequences that are closely related evolutionarily to dog, mouse, fly and worm model organism sequences.

Clicking on the entry for the human protein BEST1_HUMAN (UniProtKB: O76090) in the tree takes the user to the sequence page in Pfam from where there is a link to the full description of the protein in UniProtKB. There is a link from the UniProtKB page to the OMIM entry (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim). OMIM indicates the disease to be due to the presence of macular lesions in the subretinal space that leads to progressive loss of vision. OMIM also points to the dog as being a suitable model organism for studying the disease. Possible disease mutation sites in the human protein can be found both from the UniProt page and from the second literature reference in the Pfam family summary.

To examine the alignments of just the worm, dog and human members, the tool control panel found from clicking the Tree option on the species tab page will allow the user to select just these members

and then to view them either as graphics (Figure 8) or as an alignment (Figure 9). The graphics view shows that domain organization is highly conserved. In the alignment view, the residues known to be mutated in the disease can be shown to be conserved, and the figure highlights some of the columns in the alignment where there are known mutations, and the members where these residues are different from the more common type are arrowed at the side. It would appear that some of the Caenorhabditis elegans sequences, which are thought to function as a chloride ion channel as noted by the first literature reference on the family page, differ at some of the key positions, so using this model organism in further studies should be avoided. The Canis familiaris (dog) sequence appears to most closely resemble the human BEST1_HUMAN (UniProtKB: O76090) protein so should offer the best protein for further studies elucidating the etiology of the disease.

Figure 8

Example graphical display of architectures of sequences from selected species. This is the graphical display of the human (Homo sapiens), dog (Canis familiaris) and worm (Caenorhabditis elegans) members of the Bestrophin family (Pfam accession PF01062), selected by clicking the relevant nodes on the Species tree shown in Figure 14 and choosing the graphical display option. Similarities and differences can be seen in overall length and position of the Bestrophin domain on the sequences.

Figure 14

Interactive taxonomic tree of the species distribution of all the sequences in a family. The Tree button leads to the interactive taxonomic species tree showing the frequency of occurrence of the Bestrophin domain (Pfam accession PF01062) across different species. The tree is generated by counting the number of domain matches on all sequences in the full alignment at each taxonomic level (number shown in the purple box), along with the number of unique sequences on which each domain is found (number shown in green), then grouping sequences from the same organism according to the NCBI code assigned by UniProtKB and counting the number of distinct sequences on which the domain is found (number shown in pink). The NCBI species tree forms the framework for displaying organisms within the tree. For all but the largest families, the tree is interactive. Due to performance issues with Internet Explorer, users of that browser are shown the text form of the tree by default, with the option to load the interactive tree if required. The tree controls can be used to manipulate how the interactive tree is displayed: to show/hide the summary boxes; to highlight species that are represented in the seed

alignment; to expand/collapse the tree to a given depth; to select a subtree or a set of species within the tree, view them graphically or as an alignment, and download the sequence; and to save a plain-text representation of the tree. Users can select species of interest, as shown in the figure, and employ the tool controls accordingly.

Full figure and legend (98K) Figures & Tables index

Full figure and legend (219K) Figures & Tables index

Figure 9

Example alignment display view of sequences from selected species. The alignment of the human (Homo sapiens), dog (Canis familiaris) and worm (Caenorhabditis elegans) sequences of the Bestrophin family (Pfam accession PF01062) chosen as in Figure 8 and displayed by using the tool controls on the right-hand side of the Tree view from the Species tab on the Family page. Three of the residues known from the literature to be mutated in bestrophin in the disease Best macular dystrophy are boxed, and arrows indicate which ones are the human, dog and worm sequences.

Full figure and legend (273K) Figures & Tables index

Conclusions

Pfam is a database of conserved protein families that is widely used throughout the biological community for providing insights into domain composition and thereby informing on protein function. In this article we have briefly discussed how Pfam data are generated and presented to the user, detailing the most important features of the website and suggesting other ways in which the user can

access the Pfam data. Finally, we have included two examples to illustrate how Pfam might be used to help answer biological questions.

Availability

The Pfam websites are hosted from the Wellcome Trust Sanger Institute (WTSI) (http://pfam.sanger.ac.uk/), Stockholm Bioinformatics Center (http://pfam.sbc.su.se/) and Janelia Farm (http://pfam.janelia.org/). Pfam data can be downloaded directly from the WTSI ftp site (ftp://ftp.sanger.ac.uk/pub/databases/Pfam), either as flat files or in the form of MySQL table dumps.

Author: Penny Coggill, Jaina Mistry, John Tate, Alex Bateman & Robert D. Finn

1. Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK. * Address correspondence to: pcc@sanger.ac.uk

Vous aimerez peut-être aussi