Académique Documents
Professionnel Documents
Culture Documents
Basic Workflow
SEQUENCER
FASTQ
REFERENCE
SAM
BAM
File Formats
FASTA
>seq_1 description
ATGCTGCTGACGTAGCGATGCAGTAGCAGGTACGAGTCGCAGT
GCAGATGCA
>seq_2
GTAGACGATCGATGCAGCATGACGATGACGATGACGACGATGA
CGATAGCAGATGCA
FASTQ
text-based format
four lines entry per sequence
storing sequence and its corresponding quality score
most commonly used format to store sequencing reads
usually indicated with the suffix *.fastq or *.fq
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC
AACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
@EAS139
136
the run id
FC706VJ
the flowcell id
flowcell lane
2104
15343
197393
18
ATCACG
index sequence
Quality
Q = -10logP, where P is base-calling error probabilities
(i.e., the probability that the corresponding base call is
incorrect)
!#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
http://en.wikipedia.org
SAM
Li et al., 2009.
Li et al., 2009.
BAM
BAM is the compressed binary version of the SAM format
compact and index-able representation of nucleotide sequence
alignments.
VCF
Variant Call Format
VCF is a text
file format (most likely stored in a compressed manner). It
contains meta-information lines, a header line, and then
data lines each containing information about a position in
the genome. The format also has the ability to contain
genotype information on samples for each position
Hapmap
text-based file format
information for a series of SNPs as well as the germplasm
http://www.maizegenetics.net
http://www.maizegenetics.net
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
mRNA
76315
78595
CDS 76315
76450
.
CDS 76668
76852
.
CDS 77457
77657
.
CDS 77994
78155
.
CDS 78233
78595
.
mRNA
85322
90545
CDS 85322
86173
.
CDS 88630
89316
.
CDS 89970
90545
.
mRNA
94102
99473
CDS 98946
99473
.
CDS 97180
97620
.
CDS 96589
96819
.
CDS 95733
95797
.
CDS 95601
95658
.
CDS 94282
94350
.
CDS 94102
94200
.
0.990688
+
0
+
2
+
0
+
0
+
0
0.655887
+
0
+
0
+
0
0.967529
0
0
0
0
1
0
0
+
.
ID=Ca_11934;
Parent=Ca_11934;
Parent=Ca_11934;
Parent=Ca_11934;
Parent=Ca_11934;
Parent=Ca_11934;
+
.
ID=Ca_11933;
Parent=Ca_11933;
Parent=Ca_11933;
Parent=Ca_11933;
.
ID=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
ID=geneA;Name=geneA
ID=exonA1;Parent=geneA
gene_id "geneA";transcript_id "geneA.1";
Tools
Quality Control
Why Quality Control ?
QC Tools
FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
providing a quick overview to tell you in which areas
there may be problems
summary graphs and tables to quickly assess your
data
export of results to an HTML based permanent report
offline operation to allow automated generation of
reports without running the interactive application
BAD
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
BAD
http://prinseq.sourceforge.net
GOOD
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Alignment Tools
Also called mapping
experiments with known genome
align reads to the reference genome
computationally intensive for huge volume data and large reference
genome
Bowtie2
BWA
Roche
http://www.clcbio.com
References
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer
for Illumina Sequence Data. Bioinformatics, btu170.
James T. Robinson, Helga Thorvaldsdttir, Wendy Winckler, Mitchell Guttman,
Eric S. Lander, Gad Getz, Jill P. Mesirov. Integrative Genomics Viewer. Nature
Biotechnology (2011), 29, 2426.
Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature
Methods. (2012), 9:357-359.
Li et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics
(2009), 25 (16): 2078-2079.
Li H. and Durbin R. Fast and accurate short read alignment with BurrowsWheeler Transform. Bioinformatics (2009), 25:1754-60.
Marcel Martin. Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.journal (2011), 17:10-12
Schmieder R and Edwards R: Quality control and preprocessing of
metagenomic datasets. Bioinformatics (2011), 27:863-864.
Thank you!