Vous êtes sur la page 1sur 16

DIVISION OF CLINICAL AND EXPERIMENTAL MICROBIOLOGY DEPARTMENT OF BIOMEDICAL SCIENCES

UNIVERSITY OF SASSARI

WHOLE GENOME ASSEMBLY AND ANALYSIS


SHORT REPORT
Supervisor MASSIMO DELIGIOS, PhD. Prof. PIERO CAPPUCINELLI

NGUYEN HOANG BACH, MSc.

Sassari, 2011

Part 01

Install Cygwin for Windows 7 OS Install Velvet 1.0.19 Create contig with Velvet

A. Install Cygwin with perl, C++ compiler, debugger, and make for for Windows 7 OS Cygwin is:

a collection of tools which provide a Linux look and feel environment for Windows. a DLL (cygwin1.dll) which acts as a Linux API layer providing substantial Linux API functionality. The Cygwin DLL currently works with all recent, commercially released x86 32 bit and 64 bit versions of Windows, with the exception of Windows CE1.
1

Windows CE (now officially known as Windows Embedded Compact and previously also known as Windows Embedded CE , and sometimes abbreviated WinCE) is an operating system developed by Microsoft for embedded systems. Windows CE is a distinct operating system and kernel, rather than a trimmed-down version of desktop Windows. It is not to be confused with Windows XP Embedded which is NT-based. We can find Full Cygwin Package at URL: http://www.cygwin.com/packages/
gcc-g++ gdb make perl Larry perl-Error perl-ExtUtils-Depends perl-Graphics-Magick perl-Image-Magick perl-Locale-gettext perl-SGMLSpm perl-Tk perl-Win32-GUI perl-XML-Simple perl-libwin32 perl-ming perl_manpages GCC-3 Series legacy compiler: C++ compiler The GNU Debugger The GNU version of the 'make' utility Wall's Practical Extracting and Report Language Perl module for OO error/exception handling Build Perl XS that depend on other XS GraphicsMagick Perl bind (PerlMagick) Image manipulation software suite (Perl bindings) Perl module for using gettext and libintl Perl SGMLS parser module Perl interface for Tk (X11) Perl Win32-GUI module Perl module for simple XML access Perl extensions for using the Win32 API A SWF output library - (Perl bindings) Perl manpages

perl-ExtUtils-PkgConfig Perl module for using pkg-config

Nguyen Hoang Bach, MSc.

Page 1

The make utility automatically determines which pieces of a large program need to be recompiled, and issues commands to recompile them. This manual describes GNU make, which was implemented by Richard Stallman and Roland McGrath. Development since Version 3.76 has been handled by Paul D. Smith. GNU make conforms to section 6.2 of IEEE Standard 1003.2-1992 (POSIX.2). Our examples show C programs, since they are most common, but you can use make with any programming language whose compiler can be run with a shell command. Indeed, make is not limited to programs. You can use it to describe any task where some files must be updated automatically from others whenever the others change. B. Install Velvet 1.0.19 running on Windows 7 OS with Cygwin (including C++ compiler, debugger, and make)

What is Velvet? Velvet is a De Novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454, developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), United Kingdom. Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs. The memory requirements and time to run Velvetg? All depend on the number and size of the reads we have to assemble. The memory requirements can be estimated using a relationship we showed to this examples below. The speed at which velvetg will run is dependent on a lot of variables including: CPU type and speed, memory bus speed, size and number of reads, the value of k and many others and so is difficult to estimate. For 30 million 36mers, with a k of 29, to finish the initial velvetg run on the deskop (with 16GB RAM) in 15 - 20 minutes. Subsequent runs are faster. Therefore, 160 hours seems plenty. Our biggest concern will be the memory requirements. The memory estimator is:

Ram required for velvetg (Kb) = -109635 + 18977*ReadSize + 86326*GenomeSize + 233353*NumReads - 51092*K Ram required for velvetg (Gb) = Ram required for velvetg (Kb) / 1048576 Where Read size is in bases Genome size is in millions of bases (Mb) Number of reads is in millions K is the kmer hash value used in velveth Page 2

Nguyen Hoang Bach, MSc.

The results are +/- 0.5 - 0.8 Gbytes on this system. (64 bit Fedora 10 - quad core - 16Gb RAM) I.e: for K = 31, Number of reads = 50 million read size = 36 Genome size of 5 Megabases The estimator returns ~10.5 Gbytes of Ram required. The regression equation should be fairly valid for the following ranges: K = 15 - 31. Numreads = 5 - 70 million Genome size = 2 - 10 Megabases Read length (size) = 20 - 75 bases

C. Creat the contig of sequences data with Velvel 1.0.19

Step1:
or make make MAXKMERLENGTH=57

Step2: combine the whole genome sequences of MT_HUE_20


./shuffleSequences_fastq.pl ./data/s_5_1_FC70G0HAAXX_501827_44251_MTB_20_HUE.fastq ./data/s_5_2_FC70G0HAAXX_501827_44251_MTB_20_HUE.fastq fullseq.fastq Syntax: ./shuffleSequences_filetype.pl ./[include_path/file1_name] ./[include_path/file2_name] ./[include_path/newfile_name]

Step3
./velveth

Step4:
./velvetg Nguyen Hoang Bach, MSc. Page 3

Step5:
./velveth output_directory hash_length [-file_format] [-read_type] [filename]

output_directory hash_length Velvel_dir/output_dir The hash length is the length of the k-mers being entered in the hash table. it must be an odd number, to avoid palindromes. If we put in an even number, Velvet will just decrement it and proceed. it must be below or equal to MAXKMERHASH length (default 31bp), because it is stored on 64 bits it must be strictly inferior to read length, otherwise we simply will not observe any overlaps between reads, for obvious reasons. As is often the case, its a trade-o between specicity and sensitivity. Longer kmers bring we more specicity (i.e. less spurious overlaps) but lowers coverage (cf. below). . . so theres a sweet spot to be found with time and experience. We like to think in terms of kmer coverage, i.e. how many times has a k-mer been seen among the reads. The relation between k-mer coverage Ck and standard (nucleotide-wise) coverage C is Ck = C (Lk+1)/L where k is our hash length, and L we read length. Experience shows that this kmer coverage should be above 10 to start getting decent results. If Ck is above 20, we might be wasting coverage. Experience also shows that empirical tests with dierent values for k are not that costly to run!

[-file_format] Supported FASTA (default) fastq FASTA.gz fastq.gz eland gerald

[-read_type] Read categories are: short (default) shortPaired short2 (same as short, but for a separate insertsize library) shortPaired2 (see above) long (for Sanger, 454 or even reference sequences) longPaired

[filename] Including path

I.e:

./velveth contig 31,45,2 fastq shortPaired seq/sequences-data1.fastq seq/ sequences-data2.fastq

We then specified the hash length as 31,45,2 which runs velveth with hash lengths of 31-43 with a step of 2 (note: k-mers have to be odd). This creates seven directories named contig_31 .. contig_43. To save disk space, the Sequences file is symbolically linked by velvet to the first directory (in this case contig_31).

Step6: Running velvetg and determining optimal K


./ velvetg contig_33 -exp_cov 396.0 -ins_length1 300 -ins_length2 3000 Nguyen Hoang Bach, MSc. Page 4

The expected coverage parameter was estimated by first counting the number of reads in each library with grep piped to wc (word count): grep "@HWI-EAS210R_0001" 3kb_mp_shuffled.fastq | wc 8362680 8362680 342363112 grep "@HWI-EAS210R_0001" 300bp_pe_shuffled.fastq | wc 6069248 6069248 248522420 The first number in this output is the number of lines that match the grep pattern. We can arrive at the expected coverage by multiplying those counts by the length of reads in each library and dividing by the total length of the genome (or our best estimate of it). So to calculate the expected coverage we could perform the following calculation: ((8362680 * 38) + (6069248 * 54)) / 1,630,000 = 396. It is important to note here that we can increase the value of the -exp_cov parameter and we may see an improvement in the n50 of the assembly, but it may also produce mis-assemblies. When velvetg finishes it will output the number of nodes, n50, and max and total size of the assembly created. If we look in the contig_* directory, we will also see a few files: contigs.fa Graph LastGraph Log PreGraph Roadmaps Sequences stats.txt

These files are explained in detail, but the most useful files for post-analysis are the contigs.fa, Log, and stats.txt files. These results should be entered into the spreadsheet at the front of the lab. Running the following custom script will output the n50 as well as n90 values for this assembly. For Ubuntu Linux users, we will run: perl /usr/local/bin/calculateN50.pl auto_*/contigs.fa Where * is the value of k. We may notice that this n50 value is slightly different than what was reported by velvet. This is due to the fact that velvet reports its n50 (as well as everything else) in kmer space. For example, the relationship between coverage and kmer coverage is defined by the following: Nguyen Hoang Bach, MSc. Page 5

Ck = C (Lk+1)/L Where C=coverage, L=read length k=kmer length. For other things such as a contig length it is as simple as adding k-1 to the reported length.

Result:
Nodes: 2232 Max length: 94 408 bp Min length: 89 bp

Can delete the nodes with short length (<400 bp) with some soflware like: Geneious, CLC Genomic Workbench.

Part 02

Assembly - Blast - Mapping - Annotation

Step7 : Ligate all the nodes of contigs obtained from Velvet and create the circular genome with Geneious Pro 4.8.5 (Build 2010-03-04 10:01)
Geneious Pro is a commercial bioinformatics software platform that is both ultra-powerful and easy to use. We are able to search, organize and analyze genomic and protein information via a single desktop program that provides publication ready images to enhance the impact of our research.
Create a folder and import the config.faa into this folder. Sort all nodes by order and select all the nodes. Ligate of the node with Cloning tools -> Ligate Sequences. Select Circularize sequences to make circular genome Export the circular sequences into new folder and save this sequences (FASTA file)

Nguyen Hoang Bach, MSc.

Page 6

Step8: Create full Open Read Frame ORFs with GeneMarkS (http://exon.gatech.edu/GeneMark/genemarks.cgi)
The new gene prediction method, called GeneMarkS, utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. The GeneMarkS implementation uses an improved version of the gene finding program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the Gibbs sampling multiple alignment program. GeneMarkS predicted precisely 83.2% of the translation starts of GenBank annotated Bacillus subtilis genes and 94.4% of translation starts in an experimentally validated set of Escherichia coli genes. GeneMarkS can detect prokaryotic genes, in terms of identifying open reading frames containing real genes, with an accuracy matching the level of the best currently used gene detection methods. Accurate translation start prediction, in addition to the refinement of protein sequence N-terminal data, provides the benefit of precise positioning of the sequence region situated upstream to a gene start.

Step-by-step diagram of the GeneMarkS procedure

Figure 2. (A) In the process of GeneMarkS training there is no division of the coding sequence into two clusters.(B)The state gene represents a sequence composed of an RBS plus a spacer plus the protein-coding sequence (CDS). Gene overlaps encompass all possible types of superpositions: overlap of genes on the same strand (as observed in operons), overlap of genes on opposite strands, overlap of coding region with RBS, and so on.

Nguyen Hoang Bach, MSc.

Page 7

Sequence File upload (Upload the circular genome) Running Options Use Prokaryotic Version Output Options Email address: (to receive the result via email) Translate GeneMarkS predicted genes into proteins (Get a list of protein translations of predicted genes in FASTA format. Ideal for smooth transition to using protein data.) Run

Start GeneMarkS Result: 1. Protein Translation: Copy all of ORF and save into a FASTA fiel >Translation: 385..582 (direct), 66 amino acids MLDLVELLTHWHAGRSQVRLSESLGIDRKTVRKYTAPAIAAGIEPGGEPLSAEQWAELIG GWFPE* . 2. Gene List GeneMark.hmm PROKARYOTIC (Version 2.8) Date: Wed Apr 20 09:25:23 2011 Sequence file name: sequence Model file name: GeneMarkS_plus_Heuristic_AT_and_NONC.mod RBS: Y Model information: Pseudonative.model FASTA definition line: empty-FASTA-def-line Predicted genes Save the content into a new FASTA file

Nguyen Hoang Bach, MSc.

Page 8

Step9: Convert full ORF FASTA file (obtain from GeneMarkS) to tabular format with Galaxy Tool and Edit with MS Excel
Galaxy Tool: http://main.g2.bx.psu.edu/ Convert to tabular format we can open with MS Excel and manipulate on this file easily. - Upload the full_orf_mt_hue_20_sorted.faa and convert to tabular format. - Save tabular format file and open with MS Excel. - Insert a new column (column A # C1) and label this column (orf_0001 orf_####) - Save this tabular file and convert to FASTA format.

Step10: Blast the ORF with NCBI server via Blast2Go


Blast2GO is an ALL in ONE tool for functional annotation of (novel) sequences and the analysis of annotation data. Blast2GO can annotate thousands of sequences in one session. We can follow and modify the annotation process at any stage.

Pipeline

Nguyen Hoang Bach, MSc.

Page 9

Start Blast2GO by Java Web Start Requirements: The minimum requirement to run Blast2GO is a working Java installation (version > 1.5) (latest version is 1.6) The minimum requirement system memory is 512 MB free ( recommend: 2000-3000 MB) High speed internet connection

A. Blast all ORF with NCBI server - Create new project the import the full_orf_mt_hue_20_sorted.faa (added orf order). - Run BLAST step with configuration below

We can stop temporality the blast process, save the data and continue the blast process in next time. With 4757 ORFs of MT_HUE_20 samples and Blast Hits = 20, it takes us about 24 hours with high speed internet connection. But in this case, we use Blast Hit = 5 When the blast process finished, export the blast result as fasta file: File > Export > Exports as FASTA

Nguyen Hoang Bach, MSc.

Page 10

Step11: create GFF file to annotate circular genome MT_HUE_20


GFF (General Feature Format) lines are based on the GFF standard file format. GFF lines have nine required fields that must be tab-separated. If the fields are separated by spaces instead of tabs, the track will not display correctly. Here is a brief description of the GFF fields: 1. 2. 3. 4. 5. 6. seqname - The name of the sequence. Must be a chromosome or scaffold. source - The program that generated this feature. feature - The name of this type of feature. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon". start - The starting position of the feature in the sequence. The first base is numbered 1. end - The ending position of the feature (inclusive). score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter ".". 7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care). 8. frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'. 9. group - All lines with the same group are linked together into a single item. Nguyen Hoang Bach, MSc. Page 11

Example:
MT_HUE_20_circular MT_HUE_20_circular MT_HUE_20_circular MT_HUE_20_circular MT_HUE_20_circular MT_HUE_20_circular MT_HUE_20_circular MT_HUE_20_circular GeneMarkS GeneMarkS GeneMarkS GeneMarkS GeneMarkS GeneMarkS GeneMarkS GeneMarkS source 1 CDS CDS CDS CDS CDS CDS CDS 385 665 2991 5046 5333 6586 7683 4559459 . 582 2749 5033 5168 6586 7605 8753 . . . . . . . + + + + + + + + . . . . . . . . Name source Name orf_0001 ; locus_tag integrase catalytic region Name orf_0002 ; locus_tag transposase Name orf_0003 ; locus_tag conserved hypothetical protein Name orf_0004 ; locus_tag ---NA--Name orf_0005 ; locus_tag cytochrome p450 125 cyp125 Name orf_0006 ; locus_tag acyl- dehydrogenase fade28 Name orf_0007 ; locus_tag acyl- dehydrogenase fade29

Convert the ORFs Blast result to tabular format with Galaxy Tool Open tabular file with MS excel and separate the content of fist column into 2 column orf_0001|integrase catalytic region => orf_0001 integrase catalytic region Data > Text to column > Delimited with | > Finish Delete the value of amino acid sequence column Creat the GFF file with tabular file and the gene list with MS Excel where: C1: Name of MT circular genome (MT_HUE_20_circular) C9: =CONCATENATE("Name ",#column orf_number," ; ","locus_tag ", #column Sequence desc.) Copy the content of excel file and paste into a .txt file. Rename this file : mt_hue_20_circular.gff

Step12: Open GFF file with Geneious


To have a full genome of MT_HUE_20 strain with annotation, we use the circular sequence obtained from contigs; the sequence description obtained from Blast all ORF and GFF file in Geneious Software. Open Geneious, create a new folder with name GFF. Import the mt_hue_20_circular.gff file. Get the sequences for this gff file (the mt_hue_20_cicular.fasta) Visualize the genome in form circular: Tool -> Circular Sequences Zoom in or out to find the specific ORF

Nguyen Hoang Bach, MSc.

Page 12

A long fragment of genome MT_HUE_20 strain include many ORF

Step13: Manipulate specific gene with annotated genome of MT_HUE_20


To find a specific gene, RNA polymerase beta subunit (rpoB) gene for example, we find the information in the topBlast data to identify the name of ORF. In this case, >orf_1934|dna-directed rna polymerase subunit beta rpob.hihi We use the Geneious Software to analyze this sequences like: export the sequences; blast with NCBI server, find the mutation... Part 03 Bowtie 0.12.7, MagicViewer

1. Bowtie is an ultrafast, memory-efficient short read aligner geared toward quickly aligning large sets of short DNA sequences (reads) to large genomes. It aligns 35-base-pair reads to the human genome at a rate of 25 million reads per hour on a typical workstation. Bowtie indexes the genome with a BurrowsWheeler index to keep its memory footprint small: for the human genome, the index is typically about 2.2 GB (for unpaired alignment) or 2.9 GB (for pairedend or colorspace alignment). Multiple processors can be used simultaneously to achieve greater alignment speed. Bowtie can also output alignments in the standard SAM format, allowing Bowtie to interoperate with other tools supporting SAM, including the SAMtools consensus, SNP, and indel callers. Bowtie runs on the command line under Windows, Mac OS X, Linux, and Solaris. Bowtie also forms the basis for other tools, including TopHat: a fast splice junction mapper for RNA-seq reads, Cufflinks: a tool for transcriptome assembly and isoform quantitiation from RNA-seq reads, Crossbow: a cloud-computing software tool for large-scale resequencing data,and Myrna: a cloud computing tool for calculating differential gene expression in large RNA-seq datasets. Windows Shell: Convert full sequence reads (fastq) to .SAM file D:\Softwares\Biotool\bowtie-0.12.7>bowtie.exe -S ./indexes/Test1/fullseq.fastq align_mt.sam Syntax: bowtie_folder>bowtie.exe S./[path_file_fullseq.fastq] ./[path_file_fullseq.sam] Nguyen Hoang Bach, MSc. Page 13

2. MagicViewer help us to study in the variety of genome, such as de novo sequencing, transcriptome sequencing and targeted re-sequencing, especially exon-capture and high-throughput sequencing. For mapping purposes, SNP detections or association studies. Analyze .SAM file with MagicViewer_1.2.1_i386_win32 program Step 1: Run MagicViewer.bat file with Windows Shell D:\Softwares\Biotools\MagicViewer_1.2.1_i386_win32>MagicViewer.bat Step 2: Convert .SAM to Sorted Indexing .BAM Create new project, input reference genome FASTA file (H37Rv genome from NCBI) and Alignment file ( full sequences SAM file) MagicViewer will convert to Indexing - Sorted BAM file.

Nguyen Hoang Bach, MSc.

Page 14

ang vit

Nguyen Hoang Bach, MSc.

Page 15