Vous êtes sur la page 1sur 7

Bioinformatics Tutorial

In many ways, we live in an era of information- with what sometimes seems like too much information. Knowing how to
search for relevant information and how to separate the wheat from the chaff is a valuable tool for biologists. A new and
rapidly growing subdiscipline in biology is the field termed bioinformatics. It is possible to specialize in this area, earning
graduate degrees, but a working knowledge of the tools of bioinformatics is essential to all types of biologists and even first
year biology students.
The National Center for Biotechnology (NCBI) has created a data retrieval system called Entrez that provides access to a
wide range of data bases that contain information such as nucleotide sequences, protein sequences, protein structures, and
published literature. Once a particular record is retrieved, links make it possible to easily find associated records from other
databases, so that the whole system is integrated.
This lab exercise walks you through an Entrez based search to find relevant information about hemoglobin and the mutated
version of hemoglobin that causes sickle cell anemia.
The directions, including the specific links, are found on this handout. A worksheet is provided as a separate document. The
answers to the questions in the exercise should be recorded on the worksheet and handed in before you leave lab today.
There are many types of information that can be obtained and analyzed using bioinformatics. In this lab, you will learn to do
a few of the most common things that biologists do using bioinformatics:

Search the primary literature using Pubmed to obtain references related to your gene or protein of interest
Retrieving DNA sequences for your gene or protein
Comparing the sequence of your gene to find related sequences
Retrieving protein sequences
Finding and analyzing 3D structures for your protein

Part 1. Getting Started
Go to the following location, which is the starting point for doing an initial Entrez search across all of its data bases. You
might want to bookmark this site (referred to in this handout as the Entrez front page) since we will be returning to it several
Type the term hemoglobin in the search box and hit go to get your results.
Part 2. Literature databases
The first set of results are from databases containing references to hemoglobin in the primary literature- that is, published
journal articles. The database Pubmed contains citations from all major biomedical journals along with the published
abstracts. How many references for hemoglobin are contained in Pubmed? Record this number on your worksheets.
Now click on Pubmed to see what some of these references actually are. They are arranged in reverse chronological order of
publication, so the most recent publications are on top. Examine the first twenty shown. Record the names of the authors
for the first twenty publications.
Examine one of the references by clicking on the authors names. This should lead you to the abstract for the article and may
provide a link to the article, but note that you may have to subscribe to that journal to get it for free.
One thing you would probably like to do is to refine your search so as to find more fewer and more obviously relevant
references. Go back to the Entrez front page and repeat this search, changing your search terms to try make them more
specific. Try, for example, beta hemoglobin, and human beta hemoglobin. How do these terms change the number of
Pubmed references? Record your answer on your worksheet.

There are also references to material on the search word from books, such as text books and reference books. After
searching Entrez with human beta hemoglobin, find the entry under books regarding sickle cell anemia in the book called
Genes and Disease. Read the entry, which summarizes the molecular basis of the disease. On your worksheet, briefly
describe one of the treatments for SCA that is described in this book entry.
Go back to the Entrez search results for human beta hemoglobin. Notice that under the Pubmed link is the Pubmed Central
link. This is a good place to find free, full text journal articles. Many of the articles that you find in the regular Pubmed site
are not available on line, at least for free. However, our library can order nearly any article that you need through our
interlibrary loan service, so you should not feel limited to the free articles in Pubmed Central. All the information about the
article that you will need for the ILL request is available in the Pubmed reference- title, authors, journal, volume, etc. Click
on one of the free articles just to confirm that you can get the entire paper on line and to see the format- the pdf format
usually gives you the best view of the figures in the paper.
Part 3. Nucleotide sequence
When genes are sequenced, scientists submit the sequences into databases that allow other scientists to see and analyze those
sequences. Genbank is historically the place to deposit these sequences (from all organisms, not just humans), so it pretty
much contains every sequence ever published. This means that it often contains multiple entries for the same gene. RefSeq
is a database of sequences that is edited by NCBI to be non-redundant, containing what appears to be the single strongest
sequence for each gene. Often this makes the better place to try to retrieve a sequence. The Entrez nucleotide databases
collectively contain over 90 billion bases at this point, and the number grows exponentially.
Go back to the Entez front page. Enter the term human beta hemoglobin and hit go. Record the number of hits for
CoreNucleotide sequences on your worksheet. Then click on the nucleotide link to see what types of hits this search
produced. Do they all sound like what you are really looking for? Record the name of the first gene sequence that shows up
on your worksheet. Then scan down the list- are there any gene names on the first page that seem a bit different from what
you are looking for?
Then click on the RefSeq tab just above the actual records. Record the number of RefSeq sequences. Remember that
RefSeq is the edited database so that sequences in this database represent the best accepted sequence.
On this first page, you should find a hit for human beta hemoglobin (HBB), obtained from a sequenced mRNA. Every entry
has unique ID number. Record the number for this HBB sequence on your worksheet. Then click on this link.
Every GenBank or RefSeq reference contains a lot of information. Two important sections are the Reference section, which
list journal articles that are relevant to the sequence determination, and the Sequence section, which provides the actual
nucleotide sequence. How many references for this entry are provided? Record this number on your worksheet. Notice
that there is a linked number for each reference. Click on one- where does it take you? It should look familiar from the last
section you completed- you should see a Pubmed reference. This demonstrates one of the best features of Entrez- things are
cross referenced.
Now scroll down to the actual sequence for this gene. Since it comes from an mRNA, rather than an actual genomic
sequence, it contains none of the introns. Sequences from mRNAs may contain 5 and 3 UTRs however. Record the first 5
nucleotides on to your worksheet.
Part 4. Finding related DNA sequences
Often it is useful to compare your sequence to other DNA sequences. Entrez provides a tool to do so, called BLAST, which
stands for Basic Local Alignment Search Tool. BLAST compares sequences (it can do either nucleotide or proteins) and tells
you which other sequences are similar to yours.
Go to the RefSeq entry for the human beta hemoglobin mRNA. If you need to find it again, you can search Entrez with the
ID number youve previously recorded.
In order to use BLAST, we need to get the sequence in the correct format. The default input for BLAST is the format called
FASTA. FASTA sequences are always in the following format:

The line starting with > is called the definition line. The following lines contain the actual DNA sequence.
From the RefSeq entry, under Display at the top left, where it should say GenBank at this point, choose FASTA. This should
change the display to the FASTA format. Check to see that it follows the description above with a definition line and then
the sequence.
Select and copy the entire FASTA sequence. You could paste this directly into the BLAST window, but it is a good practice
to save sequences so that you can easily find them again. Word documents make a good place to save FASTA sequences. To
practice this, open a new Word document and paste the sequence into the Word document. You will need to convert it to
Courier 10 point font and save it to the desk top, naming the file HBB mRNA.doc.
Then return to the Entrez front page. Click on BLAST link on the upper right hand corner. We are going to do a particular
type of nucleotide BLAST called blastn that compares your sequence to other nucleotide sequences. Choose nucleotide
BLAST from the list of BLAST options. Then paste your FASTA sequence into the SEARCH box. Then, choose other as
the database to search, and be sure that the window then says nucleotide collection (nr/nt). This will search nucleotide
sequences from all organisms. Then under Program Selection, choose Somewhat similar sequences (blastn). Now click
the BLAST button and await your results.
The BLAST display of the results contains several different sections. The top is the graphic display. This shows you
visually how similar your sequence is to other sequences. The top line is your sequence, which is referred to in BLAST as
the query. Each bar then represents another sequence that has similarity to your sequence- the length of the bar indicates the
span of the sequence that has similarity. The color of the bar indicates the degree of similarity. Record the color of the bars
that show up in your graphic display and indicate how strong the similarity is based on this color.
Below the graphic display is the hit list. Each line in the list contains the sequence ID (or accession) number and name,
which a link to the GenBank entry for this gene. Remember that the top line is your query, so its ID number should look
familiar. Then there is a brief annotated description, which can give you some idea of what the sequence is, then a bit score
and then an E-value. The bit score measures the statistical significance of the alignment. A higher score means greater
similarity. E-value is the expectation value. The is an estimation of how often you might expect to find that degree of
similarity just by chance. Therefore, the lower the E-value, the more confident you can be that your sequence has real
similarity to the other sequence. Generally speaking, an E-value of greater than 10 -4 means the two sequence are not really
related. Very often scientists use an even stricter cut off, perhaps 10 -10 or 10-12. The best of your matches will be the E-value
is 0.0- that means there is really zero chance that the similarity is due to chance. Keep scrolling down and you will start to
see some really low numbers, which means the sequences are high similar but there is at least a statistical chance that they
align by chance.
From the hit list, record on your sheet the ID number, the description, the bit score, and the E-value of the top hit (the one
that is not your query, so the second sequence on the hit list.) What species is this?
Many hits also show some little colored boxes next to the E-value. These are links to other databases that contain
information regarding the sequence. We will not follow those links today, but they again illustrate how well integrated
Entrez is.
The next part of the BLAST results, located below the hit list, are the alignments. This shows the actual nucleotide
alignment for each of the hits. In addition to the bit score and E-value, the alignment gives the percent identity (number of
nucleotides that are the same) as well as the number of gaps. Then the actual alignment is shown. The top sequence is your
query, and the bottom is the hit, or subject. The line in between indicates an identity and no line shows a mismatch, or gap.
The numbers on the side indicate the coordinates (what nucleotide numbers in your query sequence and the hit sequence) are
aligning. These numbers dont necessarily refer to the actual beginnings and ends of the gene, just to the start and stop of the
sequence that was submitted.
From the hit list, scroll down the hits until you see the match for rat (Rattus norvegicus). Then click on the underlined
number in the column labeled Max Score. This will take you directly down to the alignment for that sequence. Observe the

bit score, E-value, and percent identity. Record on your worksheet the percent identity for this hit, and percent of gaps.
Also record the coordinates of the query and subject sequences for the first line of the alignment.
We happen to have chosen an example that has a number of very highly related sequences in the databases. More frequently,
your BLAST searches will find sequences that are not nearly as well aligned and have lower E-values, and those alignments
show far more gaps.
Part 5. Aligning two sequences.
BLAST will also allow you to attempt to align two DNA sequences yourself. We are going to align the normal sequence for
hemoglobin with the mutated sequence that produces sickle cell anemia. First we need to find that sequence.
Go back to Entrez and search for hemoglobin S, which is the mutant form of beta hemoglobin found in individuals with
sickle cell anemia. This time, click on the tab for mRNAs (there should be only 1). Click on the link, and record the ID
number for this entry on your worksheet.
Then go to the BLAST page. The alignment is under Specialized BLAST types where it will say Align 2 sequences (bl2seq).
This type of BLAST allows you to enter the ID #s for the two sequences, or to paste FASTA sequences into the boxes. You
could find and save the FASTA sequence for the sickle cell sequence into a Word document, as you did for the normal
hemoglobin earlier, but this time well use the accession numbers instead. Enter the accession numbers from your
worksheets for the two sequences (HBB first and then the hemoglobin S sequence). Be sure to do this carefully, including
capitalization and underlining, or by cutting and pasting. (The space in the HBB sequence should be entered as an underline:
NM_000518). Then click on align.
The alignment is shown just as in the blastn results. Find the places where the sequences do not show identities and record
the number and position of these non indentities on the worksheet.
Part 6. Retrieving protein sequences
Return to the Entrez search results for human beta hemoglobin. This time, look at the protein results. How many hits are
there? Record this number on the worksheet. Click on the protein results and then click on the RefSeq button (like
nucleotides, this is the edited list). Record the accession number for the hemoglobin protein sequence on your worksheet.
Click on the hit to open it. The format for the protein entries in GenPept is quite similar to GenBank. Scroll to the bottom
and observe the sequence. Amino acids are represented by a single letter code. Record on your worksheet the first 10
amino acids of the protein.
Now, search the protein database for hemoglobin S. Scroll down the list of hits until you see hemoglobin beta chain variant
Hb-S Wake Homo sapiens. Record the accession number for this hit on your worksheet. Then click on the hit and examine
its sequence by scrolling to the bottom. Record the first 10 amino acids on your worksheet.
Remember that all proteins start with methionine (M) because AUG functions as the strat codon for all mRNAs, but
hemoglobin, like many proteins, has the methionine removed post translationally. Assuming this happens, at what amino
acid number is the mutation in the sickle cell hemoglobin found? What letter should it be, and what letter is it in the
mutation? Record this information on your worksheet.
Is this the same information that your handout for the wet lab portion of todays lab explains? What impact does this have on
the protein?
Part 7. Protein structures.
Return to the Entrez site and click on the Structure link. The first thing we will need to do is to download NCBIs new 3D
structure viewer called Cn3D.

Click on the underlined Cn3D, and then click on the Download link. Click on the PC Macintosh link and then download the
software. Then hide your Browser, and open the downloaded folder. Click on the icon to start the software.
Then return to your browser and return to the Structure link from Entrez. Search the Structure database using the keyword
sickle cell. Find the hit for The High Resolution Crystal Structure of Deoxyhemoglobin S.
Then click on the 3D image to open it. You will probably be asked which application to choose to open it with. Just click on
OK and you should see a 3D color image of the protein appear. The fun part is being able to rotate the molecule. Can you
identify alpha helices and beta sheets within the structure? Can you find the binding sites for iron? Record a brief
description of the secondary structures present in the protein on your worksheet.
Only proteins for which their crystal structure has been solved are represented here, so it is likely that for many proteins you
may not find the exact protein that you are interested in. But you may be able to find related proteins that will help you
understand key elements in the structure of your protein.
Part 8. Finishing up.
When you look at protein structures, the downloaded file is placed on your desktop. Hide open applications, and drag any of
these files to the trashcan on the dock. Also drag the Word document with your FASTA sequence to the trash. Then quit all
open applications. Then answer the two summary questions on your worksheet, and hand it in.

Bioinformatics Worksheet
Part 2. Literature searching
1. # of hemoglobin references in Pubmed


2. Article authors

3. # of beta hemoglobin references in Pubmed


4. # of human beta hemoglobin references in Pubmed


5. Brief description of treatment for SCA from book entry:

Part 3. Nucleotide sequences
1. # of nucleotide hits for human beta hemoglobin


2. The first sequence on the list is:

3b. # of RefSeq hits for human beta hemoglobin


4. ID # for HBB mRNA sequence


5. # of Pubmed references provided for this sequence


6. First 5 bases of sequence


Part 4. BLASTing sequences to look for similarities

1. The color of the bars in the graphic display


2. This indicates (high, medium, low) level of similarity

3. My top hit information:


4. The rat beta globin information:

Percent identity


Percent gaps


4. Coordinates of the first line (first nucleotide-last nucleotide)


Part 5. Aligning two sequences

1. ID # for hemoglobin S mRNA


2. # of non identical nucleotides

3. Coordinates of these gaps:


Part 6. Retrieving protein sequences.

1. # of hits for human beta hemoglobin


2. ID# for beta hemoglobin protein


3. First 10 amino acids in single letter code:

4. ID# for hemoglobin S protein


5. First 10 amino acids in single letter code:

6. Amino acid number at which the mutation occurs


7. Normal amino acid at that position


8. Mutated amino acid at that position


Part 7. Viewing 3D structures of proteins

1. Brief description of sickle cell hemoglobin
Part 8. Summary questions.
1. In your own words, describe the usefulness of the Entrez site for a biologist. Give a specific example of a task
that a biologist could accomplish using Entrez.

2. What information about beta hemoglobin and/or sickle cell hemoglobin did you obtain from this exercise?