Vous êtes sur la page 1sur 32

Building Your Own Gene Machine

With Unix/Linux
Robert A. Cramer Jr., Ph.D.
Department of Veterinary Molecular Biology
Montana State University
Seminar Purpose
YOU .. CAN .. DO . IT!!!
Shhhhhhh .. AND YOU SHOULD!
Oh My %*&# NOT THE COMMAND LINE!
Why?
If you work in biology and use
molecular/genomics tools . You Have To!
(One way or another)
Independence .. Do it yourself!
Convenience anytime, anywhere
GUIs on the internet have their limitations
But you probably already know that
Fun?
My Story Im not a
bioinformatician .. But
5,000 ESTs from a mixed-infection library ..
What to do?
I wanted to graduate before 2020, so analyzing
one sequence at a time was not going to cut it
.. !
No cluster informatic resources available to me
more or less on my own .
Hello Command Line Hello UNIX .. Hello MAC
Building Your Gene Machine
Step 1: Become Familiar with Unix
Commands (Or Linux if you prefer PCs)
Intimidating part for most .. But it is painless .
Really . Okay, maybe just a bit . :-)
Step 2: Install Basic Informatics Software
Most Scientists Try and Start Here then Proceed
to 1 :-)
Step 3: Trial and Error Yes, can I have
some more CT Drill Sergeant? Well, yes, you
must!
Unix (On the almighty Mac)
OSX is a flavor Unix - So is Linux
Windows is DOS based .. Ugh.
MAC gives you best of both worlds.
Terminal - direct link to the computer - you are the
boss! Under - /Applications/Utilities on MAC
X11 on Macs - can install from Developer Tools Disc
that comes with all Macs (Encourage you to install, not
all open source software comes in binary form! Includes
latest gcc compiler). In Applications. Allows you to run
graphical X programs (like PHYLIP or CLUSTALX).
Linux on the PC - Many flavors, RedHat Fedora is Free
--- I learned on PC running RedHat Linux
Unix Basics
The SHELL - command interpreter
BASH most popular, followed by csch or tcsh; I use
tcsh, why? I learned it first.
Hierarchical system**
Directories (like folders on Windows or Mac)
Sub-Directories
Files
KNOW WHERE YOU ARE!!!! Key Unix Concept
Unix Commands all lowercase - Unix is case sensitive
Unix Command: pwd - show current working directory
Unix Command: cd - change directory
When you start-up terminal you are in your HOME directory
Unix Command: ls - lists whats in the current directory
Unix Commands - Easy to find, just use the google
http://www.cs.drexel.edu/~kschmidt/Ref/unix_reference.html
THE COMMAND
man command
Will bring up manual for any Unix
command telling you how to use it and
what it is used for
Wow, how user friendly!
The Biggest Mistake .
Most common mistake beginning Unix users make is not
understanding the concept of working directories and PATH
To execute a program you MUST be in the directory the
program is installed
Computers are STUPID!!!! You MUST tell them everything (with
no syntax errors).
UNLESS . You set your PATH
Log in file that tells stupid computer where to look when you run
commands
.tcsh, .cshrc, etc. etc.
Editors .. Can edit your login file or any file for that matter, I use
vi or pico
Editors have their own sets of commands again GOOGLE! Or
buy a book!
Path
From: http://www.dartmouth.edu/~rc/classes/unix1/print_pages.shtml
Second Biggest Mistake
Directory and File Permissions!
Unix is very secure, but you have to be aware of your
permissions when installing software and writing files to
directories
ROOT user always has permission
So many software installs are done as ROOT
If you try and install a program, or make a new directory and
an error comes back telling you that you do not have
permission, you know why!
Permissions
Modified From: Kschmidt, Drexel
chmod command can modify permissions
Third .
File formats really, a lot of bioinformatics
is manipulating sequence files into correct
formats.
Common Complaint: Student to Instructor:
I keep trying to run my protein sequence in a
local blast but it does not work. I dont know
why, I got my sequence from NCBI and cut
and paste it into Microsoft Word, saved it and
now BLAST does not work
Files ..
>YDR044W Chr 4
MPAPQDPRNLPIRQQMEALIRRKQAEITQGLESIDTVKFHADTWTRGNDGGGGTSMVIQD
GTTFEKGGVNVSVVYGQLSPAAVSAMKADHKNLRLPEDPKTGLPVTDGVKFFACGLSMVI
HPVNPHAPTTHLNYRYFETWNQDGTPQTWWFGGGADLTPSYLYEEDGQLFHQLHKDALDK
HDTALYPRFKKWCDEYFYITHRKETRGIGGIFFDDYDERDPQEILKMVEDCFDAFLPSYL
TIVKRRKDMPYTKEEQQWQAIRRGRYVEFNLIYDRGTQFGLRTPGSRVESILMSLPEHAS
WLYNHHPAPGSREAKLLEVTTKPREWVK*
Text File From A Text Editor: This is GOOD
??^Q ^Z?^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>^@^C^@??
^@^F^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@%^@^@^@^@^@^@^@^@^P^@^@'^@^@^@^A^@^@^@????^@^@^@
^^
@$^@^@^@??????????????????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????????????????????????????
????????????????????????????^@~Gb
^D^@^@?^R?^@^@^@^@^@^A^Q^@^A^@^A^@^F^@^@b^G^@^@^N^@jbjb^B?^B?^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@
..^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>YDR044W
Chr 4
^MMPAPQDPRNLPIRQQMEALIRRKQAEITQGLESIDTVKFHADTWTRGNDGGGGTSMVIQD^MGTTFEKGGVNVSVVYGQLSPAA
VSAMKADHKNLRLPEDPKTGLPVTDGVKFFACGLSMVI^MHPVNPHAPTTHLNYRYFETWNQDGTPQTWWFGGGADLTPSYLYEE
DGQLFHQLHKDALDK^MHDTALYPRFKKWCDEYFYITHRKETRGIGGIFFDDYDERDPQEILKMVEDCFDAFLPSYL^MTIVKRRKDM
PYTKEEQQWQAIRRGRYVEFNLIYDRGTQFGLRTPGSRVESILMSLPEHAS^MWLYNHHPAPGSREAKLLEVTTKPREWVK*^M^M^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
Same file from Word: Gee, wonder why this does not work?
Okay Already . Your Gene Machine
This is just an intro! The software you can
install on your own personal gene machine is
virtually limitless these days install what you
need.
These are some of the basic essentials that I
use routinely to analyze genomic sequence data
BLAST - NCBI or Wash U
- Emboss - Must have
- HMMER - Hidden Markov Models for Gene Finding
- Prosite - Patterns and Profiles from proteins
- FINK - Incredible resource for MAC users (another
reason to use a MAC if you do a lot of informatics)
Installing Local BLAST
NCBI FTP site - on NCBI home page
Download appropriate version for your flavor
of Unix!
Know where you install it
Completely up to you!
Some people install all programs (executables) in the
directory /usr/local/bin
Some people install programs in their own respective
directories I.e. /Users/rcramer/BLAST
Regardless, you should make sure your
installation directory is in YOUR PATH
Now the Installation
Unpack the file in your favorite directory!
*you may need to do this as root user if you get
an error saying you do not have permission
rcramer% mkdir /usr/local/bin or sudo mkdir /usr/local/bin as root
rcramer% mv /Users/rcramer/Desktop/blastetc.tar.gz /usr/local/bin
rcramer% cd /usr/local/bin
rcramer% gunzip blastetc.tar.gz | tar xf -
Follow the UNIX install and testing of the installation instructions in the
README.bls file
Youll know its working if you type:
rcramer% blastall
And get a list of various options
Dont forget to set your path in your .cshrc file!
vi .cshrc
set path= ( /Users/rcramer/blast/blastetc/bin ${path})
Step 2- BLAST Databases
The power of local BLAST is you can install multiple genome
databases or any type of sequence database that you use
routinely!
Databases can be obtained at NCBI or your
favorite organisms genome homepage
Usually in FASTA format
Use the formatdb command to format your database
Make sure you format it correctly, protein or
nucleotide!
formatdb -i afu_peptides.seq -p T -o T
Advantages of Local Blast
Can make your own BLAST databases
Can run batch blast I.e. many sequences at
the same time and not compete with others
on the internet server
Can do BLAST searches where ever, when
ever, regardless of whether you have internet
access
Control --- can control the output, many many
options!!! (Important for downstream
analyses)
EMBOSS
http://emboss.sourceforge.net/
Comprehensive sequence analysis tool-kit
Contains Hundreds of sequence analysis programs
All free!!
Can be run from command line, allows you to Script together
several programs at a time (real analysis power when you start
doing this)
Several GUIs are also available to download and install
Step 1: Acquire Latest Release
Step 2: Install According to Instructions
Remember your permissions (root), PATH
http://emboss.sourceforge.net/docs/adminguide/node8.html
Step 3: Test Run!
Example EMBOSS Install
Download EMBOSS-3.x.x.tar.gz
Create directory you want to install emboss in: *Do this as ROOT
rcramer # mkdir /Users/rcramer/emboss
rcramer # mv EMBOSS-3.x.x.tar.gz /Users/rcramer/emboss
rcramer # gunzip EMBOSS-3.x.x.tar.gz
rcramer # tar -xf EMBOSS-3.x.x.tar.gz
This last step makes a NEW DIRECTORY EMBOSS-3.X.X
rcramer # cd /Users/rcramer/emboss/EMBOSS-3.X.X
rcramer # ./configure
** You ned a gcc compiler installed!!!
rcramer # make
rcramer # make install
Make sure you SET your PATH in your .cshrc file!
I.e. set path= ( /Users/rcramer/emboss/EMBOSS-5.0.0/emboss/ ${path})
Some EMBOSS applications use GUIs, you need to set the PLPLOT
environmental variable AND have X windows interface (MAC USERS = X11)
In your .cshrc file: setenv PLPLOT_LIB /Users/rcramer/emboss/EMBOSS-5.0.0/plplot/lib
Wossname is your EMBOSS friend
Try running wossname
rcramer % wossname restrict
SEARCH FOR 'RESTRICT'
recoder Remove restriction sites but maintain same translation
redata Search REBASE for enzyme name, references, suppliers etc
remap Display sequence with restriction sites, translation etc
restover Find restriction enzymes producing specific overhang
restrict Finds restriction enzyme cleavage sites
showseq Display a sequence with features, translation etc
silent Silent mutation restriction enzyme scan
Can you find a program to:
Display multiple alignments - Yes
Find ORFs (Open Reading Frames) - Yes
Translate a sequence - Yes
Find restriction enzyme sites - Yes
Find the isoelectric point of a protein - Yes
Do global alignments - Yes
Write your dissertation - No
EMBASSY
A group of programs similar to EMBOSS but
kept separately. So need to install separately:
HMMER, MEME, TOPO, PHYLIP, and more!
Detailed installation instructions for both
EMBOSS and EMBASSY:
http://emboss.sourceforge.net/docs/adminguide/admin.html
Your Gene Machine
If you install BLAST with your favorite databases ..
EMBOSS Package
EMBASSY Package
Youve created a very powerful and useful personal
gene machine that you can use anywhere,
anytime!
Of course there is much more available. ClustalW,
Prosite, MUSCLE, PHRED, PHRAP, etc. etc.
What you put on your Gene Machine is up to you
Last - Maybe Most Important?
http://www.finkproject.org/
An absolute must to have installed if you are MAC
USER (and you should be if you do a lot of
informatics!)
Fink Packages
Remember ..
You have to engage the command line
You will fail, but the computer will always tell you
what is wrong. So try again! (Dont forget about
the google)
PERMISSIONS
PATH
ENVIRONMENT
FILE FORMAT
Most of the time you will fail because one of
the above 4 is not right
Some Resources
Each program will have a manual, often just running
the program w/o any arguments will bring up all the
possible options and tell how the correct syntax
Google
Introduction to Unix:
Just google this, LOTS of webpages with basic
Unix commands, lectures etc.
MSU Bioinformatics Core Facility - Intro to Unix
Class, Computational Cluster, etc.
Books - lots of good intro to Unix books out there,
OReiley Series.
Lets take the Gene Machine for a test drive
Non-ribosomal Peptide
Synthetase Gene
New Sequenced Genome
How many NRPS does it
have?
Simple Right? Yes, but .
Multiple domains make
BLAST search inconclusive
But BLAST will narrow the
field
HMMER or PROSITE can
give definitive number by
examining domains
All done in a matter of
minutes while you watch
The Office
Do I have to do
this with one
sequence at a
time? NO!!