Vous êtes sur la page 1sur 13

BIOINF525:INTRODUCTIONTOBIOINFORMATICSLABSESSION4

GenomeInformatics
Dr.RyanE.Mills&HongyangLi
Feb2016

Overview:
Thepurposeofthislabsessionistocoverasetoftoolsusedin
highthroughputsequencingandtheprocessofinvestigatinginterestinggenevariance
inGenomics.

Introduction
Highthroughputsequencingisnowroutinelyappliedtogaininsightintoawiderangeof
importanttopicsinbiologyandmedicine[see:Soonetal.EMBO2013onCtools].

Inthislabwewillusethe
Galaxy
webbasedinterfacetoasuiteof
bioinformaticstoolsforgenomic
sequenceanalysis.Galaxyisfree
andcomparativelyeasytouse(see
Figure1foraschematiccomparison
ofsomecommonbioinformatics
RNASeqanalysismethods).

Galaxywasoriginallywrittenfor
genomicdataanalysis.However,the
setofavailabletoolshasbeengreatly
expandedovertheyearsandGalaxyisnowalsousedforgeneexpression,genome
assembly,proteomics,epigenomics,transcriptomicsandhostofothersubdisciplinesin
bioinformatics.

RegisteringforaGalaxyaccount
FirstcreateanaccountonthemainpublicGalaxyportal@
https://main.g2.bx.psu.edu/

Underthe
Usertab
atthetopofthepage,selectthe
Register
linkandfollowthe
instructionsonthatpage.


Thiswillonlytakeamoment,andwillallowalltheworkthatyoudotopersistbetween
sessionsandallowyoutoname,save,share,andpublishGalaxyhistories,workflows,
datasetsandpages.

Section1
:Findtheinterestinggenomevariance
Thereareanumberofgenevariantsassociatedwithchildhoodasthma.Astudyfrom
Verlaanetal.(2009)showsthat4candidateSNPsdemonstratesignificantevidencefor
association.YouwanttofindwhattheyareinOMIM(
http://www.omim.org
)

Q1:Whatarethose4candidateSNPs?
[HINT,youmaywanttocheckthefirstfewlinksofsearchresult]
rs12936231,rs8067378,rs9303277,andrs7216389

Q2:Whatarethreegenesbeaffected?
ZPBP2,GSDMB,andORMDL3

Now,youwanttoknowthelocationofSNPsandgenesongenome.Youcanfindthe
informationonUCSCgenomebrowser(
http://genome.ucsc.edu
)orEnsemblgenome
browser(
http://www.ensembl.org
).

Q3:Whatisthelocationofrs8067378?Whatarethedifferentallelesforrs8067378?
[HINT,youmaysearchingenomebrowser]

Q4:Whatarethedownstreamgenesforrs8067378?AnygenesnamedZPBP2,
GSDMB,andORMDL3?

YouareinterestedinthegenotypesoftheseSNPsinaparticularsample(
HG00109
).
Gotothe1000genomesbrowser(
http://browser.1000genomes.org/
)andlookuptheir
genotypes.

Q5:Whataretheindividualgenotypesfortheparticularsample(
HG00109
)?
[HINT:use1000genomesbrowsertolookupgenotypes]

Section2
:

RNASeqanalysis
Now,youwanttounderstandwhethertheSNPwillaffecttheexpressionofthegene.

YoufindtheRNASeqdataofonesampleonCTools(HG00109_1.fastq,
HG00109_2.fastq).However,thisistherawsequencefastqfile.Moredetailaboutfastq
format(
http://en.wikipedia.org/wiki/FASTQ_format
).Tohaveaquickanalysisofthe
data,youdownloadanduploadthefiletoGalaxy.

Becarefulofthefiletype.Tophat2onlytakesfastqsangerfileformat.So,Youneedto
choose
fastqsanger
fortheType.

Now,youcancheckthedataontherightpanel.So,youwillhavebetterunderstanding
aboutwhateachcolumn/rowrepresent.

Q6:Whatisthesizeandformatofthedata?

Q7:Whatdoesthefirst,secondandfourthrow
represent?
[HINT,youcancheckthefastqformatwikiformore
information]

Q8:Doesthefirstsequencehavegoodquality?
[HINT,whatisthequalityscoreforeachnucleotide?]

QualityControl
Youshouldunderstandthereadsabitbeforeanalyzingthem.Runaqualitycontrol
checkonyourdatausingthe[NGS:QCandmanipulation>]FASTQCtool.Often,itis
usefultotrimreadstoremovebasepositionsthathavealowmedian(orbottom
quartile)score.

AfterrunningtheFastQCprogram,youwillgeta
FastQCReport.Clickontheredbox,thereportwill
showinthecenterofdatabrowser.

Q9:WhatistheGCcontentofandformatofthe
fastqfile?
[HINT,youmaycheckBasicStatistics]

Q10:Howaboutperbasesequencequality?Doesanybasehaveamedianquality
scorebelow20?
[HINT,bluelineisthemedianqualityscore.]

Q11:Forthisexercise,assumeamedianqualityscoreofbelow20tobeunusable.
Giventhiscriterion,istrimmingneededforthedatasets?

Mapreadstogenome

Thenextstepismappingtheprocessedreadstothegenome.Themajorchallenge
whenmappingRNAseqreadsisthatthereads,becausetheycomefromRNA,often
crosssplicejunctionboundariessplicejunctionsarenotpresentinagenome's
sequence,andhencetypicalNGSmapperssuchas
Bowtie
(
http://bowtiebio.sourceforge.net/index.shtml
)and
BWA
(
http://biobwa.sourceforge.net/
)arenotidealwithoutmodifyingthegenomesequence.
Instead,itisbettertouseamappersuchas
Tophat
(
http://ccb.jhu.edu/software/tophat
)
thatisdesignedtomapRNAseqreads.

Usethe[NGS:RNAAnalysis>]TophattooltomapRNAseqreadstothehg19build.
Thedatayougotispairenddata.InGalaxy,youneedtosetforwardreadfileand
reversereadfile.Becausethereadsarepaired,you'llneedtosetmeaninnerdistance
betweenpairsthisistheaveragedistanceinbasepairsbetweenreads,notthetotal
insert/fragmentsize.Useameaninnerdistanceof150forourdata.

Therewillbefouroutputs:accepted_hits,insertions,
deletionsandsplicejunctions.Youcanvisualizethe
accepted_hitsonyourfavoritegenomebrowser,like
UCSCGenomeBrowser.

Q12:Whatisthefirstentryofsplicejunctions?Whereis
thejunctionlocated?
[HINT,checktheoutputofTophatsplicejunctions]

Q13:Wherearemostthehitslocated?
[HINT,youcanviewtheacceptedhitsinUCSC
GenomeBrowser,andsearchregion:
chr17:3800729638170000
]

Q14:FollowingQ13,isthereanyinterestinggene
aroundthatarea?
[HINT,youcanfindgenesaroundacceptedhitsin
UCSCGenomeBrowser]

ThemappedreadsonUCSCGenomeBrowser:

WithalignmentresultfromTopHat,youcancalculatethegeneexpressionbyCufflinks
(
http://coletrapnelllab.github.io/cufflinks/
).BeforerunningCufflinks,youshouldupload
thereferenceannotationfilegene_chr17.gtf(onCToolsalso)intotheworkspaceof
Galaxyfirst.Thefollowingfigureshowswhatparametersyouneedtochange.

Q15:WhatistheFPKMforthegenefromQ13?
136853

Section3
:

PopulationScaleAnalysis
Onesampleisnotenoughtoknowwhatishappeninginapopulation.Youare
interestedinassessinggeneticdifferencesonapopulationscale.So,youprocessed
about~230samplesanddidthenormalizationongenomelevel.Now,youwanttofind
whetherthereisanyassociationofthe4asthmaassociatedSNPs(
rs8067378)
on
ORMDL3
expression
.

Thisisthefinalfileyougot(
http://tinyurl.com/bioinfo525lab4data
).Thefirstcolumnis
samplename,thesecondcolumnisgenotypeandthethirdcolumnistheexpression
value.


YouwrotesomeRcodetogetanoverviewaboutthedata.TheRcodeisdisplayed
here(
http://bit.ly/1wXl4Eo
).(WewillintroduceRinthenextlab)

Q16:WhatisthesamplesizeforA/A?
[HINT,thelowersectionofthebrowsercontainstheoutputforyourRcode.genois
thecolumnforgenotypesamplesize]

Q17:WhatisthemedianexpressionvalueforA/AandG/G?
[HINT,youcanfindthevaluefromtheuprightgraphs.Thegraphisaboxplot,which
youcanlearnmorefromhere(
http://en.wikipedia.org/wiki/Box_plot
)]

Q18:WhatcouldyouinferfromtherelativeexpressionvaluebetweenA/AandG/G?
DoestheSNPeffecttheexpressionofORMDL3?

Q19:Whatonepartofthislaborassociatedlecturematerialisstillconfusing?If
appropriatepleasealsoindicatethequestionnumberfromthislabinstructionpdfand
answerthequestioninthefollowinganonymousform:
http://tinyurl.com/bioinfo525lab4

Alldatafilescanalsobefoundat:
https://github.com/ajing/Bioinfo525_lab4

YoucanalsosearchinPublishedWorkflowforBioinfo525_lab4,whichcontainsthe
secondsectionofthelab.

Reference
:
Verlaan,etal.AllelespecificchromatinremodelingintheZPBP2/GSDMB/ORMDL3
locusassociatedwiththeriskofasthmaandautoimmunedisease.Am.J.Hum.Genet.
85:377393,2009.

Thesecondsectionofthelabisadaptedfrom
https://usegalaxy.org/u/jeremy/p/galaxyrnaseqanalysisexercise
.

Vous aimerez peut-être aussi