Vous êtes sur la page 1sur 15

talks

Indel-based Realignment

Improving the original alignments of the


reads based on mul8ple sequence
(re-)alignment
You are here in the GATK Best Prac8ces workflow
for germline variant discovery

Data Pre-processing >> Variant Discovery >> Callset Refinement

Raw Reads
111 Analysis-Ready Var. Calling 111 Analysis-Ready SNPs
Reads HC in ERC mode Variants & Indels

Map to Reference Genotype Likelihoods


Non-GATK

BWA mem
Genotype
Mark Duplicates Refinement
Variant
& Sort (Picard) Joint Genotyping
Annotation

Indel Realignment Raw Variants SNPs Indels


Variant Evaluation
Base Recalibration
Variant Recalibration look good?
separately per variant type
Analysis-Ready
Reads
Analysis-Ready
SNPs Indels troubleshoot use in project
Variants
InDels = inser8on/dele8on

Ref seq A G C T A G G G T C A G C T A G G G T C

Sample seq A G C T A G G G T C A G C G G T C
TC
T

Inser&on Dele&on
The problem we want to fix

Alignment by BWA
Several consecu3ve
“SNPs” only found on
reads ending on the
right of the
homopolymer
Several consecu3ve
“SNPs” only found on
reads ending on the
le; of the
homopolymer 7bp “T”
homopolymer run

A;er realignment

Adding a
1-bp inser3on
brings sanity to
the en3re
alignment
Why does this happen?

•  Mappers cannot “see” indels near ends ref CATG


of reads ins CA CCA TG
•  Because mismatches are “cheaper” than a gap in this
context
del CA G
Ref T A C C C A T T T T T T T C T A A A A G C T
Missmatch =-1
BWA C C A T T T T T T C T A A A A A C T
Open gap = -3
IR C C A – T T T T T T C T A A A A A C T

þ  Local realignment around indels -> most parsimonious alignment

þ  Improves accuracy of several downstream processing steps
How do we iden8fy where realignment is needed?

•  Known sites (e.g. dbSNP, 1000 Genomes)

•  Indels seen in original alignments (in CIGARs)

•  Sites where evidence suggests a hidden indel

- Entropy calcula8on iden8fies “messy areas”


How does the realignment algorithm work?

1. Find the best alternate consensus sequence that, together with the
reference, best fits the reads in a pile (maximum of 1 indel)

Ref: AAGAGTAG AAGAGTAG


Read pile consistent
with the reference
Three adjacent
Realigning sequence
SNPs
determines AAG---AGTAG
which is Read pile consistent
be`er with a 3bp inser8on

2. Score for alternate consensus = total sum of quality scores of mismatching bases

3. If best alternate consensus is sufficiently be`er than the original alignments


(using LOD score threshold) -> accept proposed realignment
Indel Realignment steps/tools

•  Iden8fy what regions


need to be realigned

➔  RealignerTargetCreator

•  Perform the actual


realignment

➔  IndelRealigner

RealignerTargetCreator

Known Sites

Input BAM RealignerTargetCreator Target Intervals

java –jar GenomeAnalysisTK.jar \ •  Pre-processing step to find intervals


–T RealignerTargetCreator \
–R human.fasta \ that may need realignment
–I original.bam \
•  Input BAM file not necessary if
–known indels.vcf \
IndelRealigner
–o realigner.intervals processing only at known indels

•  Using a list of known indels will both


speed up processing and improve
Realigned BAM
accuracy, but is not required
IndelRealigner
Input BAM Known Sites Target Intervals

IndelRealigner •  A`empts realignment at


RealignerTargetCreator target
intervals
•  Must use same input file(s) used in
Realigned BAM RealignerTargetCreator step
•  Processing op8ons
-  Only at known indels: much faster,
accurate for ~90-95% of indels
-  At indels seen in the original BAM
java –jar GenomeAnalysisTK.jar \
alignments: the recommended
–T IndelRealigner \ mode
–R human.fasta \ -  Using full Smith-Waterman
–I original.bam \ realignment: most accurate, but
–known indels.vcf \ heavy computa8onal cost and not
–targetIntervals realigner.intervals \
–o realigned.bam
really necessary with the new techs
This is what a realigned BAM looks like

Before Aier
Old data
(lower quality)

New data
(higher quality)

DePristo, M., Banks, E., Poplin, R. et. al, A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Gen.
Can I see the effects of realignment?

•  Indel Realigner changes the CIGAR string of realigned reads


but maintains the original CIGAR (with OC tag)

-> Can grep for realigned regions and view in genome browser (IGV)
20GAVAAXX100126:1:67:10041:180738 99 20 10011431 70 87M1D14M = 10011720 390

TTAAATGTGTTTATCTATTGTTCTACTATTCAGTTACCTGATTATAAAATCAAAGATTATTTCATGAAACTCAGTACCCCTTCAGGGAAAAAAAAA
AAAAT

HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
HHHHHHHHHGGGGGGGG X0:i:1 X1:i:0 MC:Z:101M OC:Z:101M PG:Z:MarkDuplicates RG:Z:20GAV.1 XG:i:0 AM:i:37
NM:i:1 SM:i:37 XM:i:1 XO:i:0

BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@cccddc``a`^\[Y MQ:i:60 XT:A:
Is realignment s8ll necessary with latest soiware?

•  Variant callers with reassembly step (HaplotypeCaller, MuTect 2, Platypus)


do not require indel realignment

•  BUT poten8al improvement for Base Quality Score Recalibra8on when run
on realigned BAM files (ar8factual SNPs are replaced with real indels).

•  Also s8ll useful for legacy tools


–  Unified Genotyper
–  MuTect 1
You are here in the GATK Best Prac8ces workflow
for germline variant discovery

Data Pre-processing >> Variant Discovery >> Callset Refinement

Raw Reads
111 Analysis-Ready Var. Calling 111 Analysis-Ready SNPs
Reads HC in ERC mode Variants & Indels

Map to Reference Genotype Likelihoods


Non-GATK

BWA mem
Genotype
Mark Duplicates Refinement
Variant
& Sort (Picard) Joint Genotyping
Annotation

Indel Realignment Raw Variants SNPs Indels


Variant Evaluation
Base Recalibration
Variant Recalibration look good?
separately per variant type
Analysis-Ready
Reads
Analysis-Ready
SNPs Indels troubleshoot use in project
Variants
talks

Further reading
h`p://www.broadins8tute.org/gatk/guide/best-prac8ces

h`p://www.broadins8tute.org/gatk/guide/ar8cle?id=38

h`ps://www.broadins8tute.org/gatk/gatkdocs/
org_broadins8tute_gatk_tools_walkers_indels_IndelRealigner.php

h`ps://www.broadins8tute.org/gatk/gatkdocs/
org_broadins8tute_gatk_tools_walkers_indels_RealignerTargetCreator.php