Vous êtes sur la page 1sur 31

talks

Variant evalua,on

Tools and methods for handling and


comparing variant callsets
We are here in the Best Practices workflow
Variant evalua,on

HANDLING CALLSETS
TOOL TIPS
Create subsets with Select Variants

Inputs
V Original VCF le
select A selec,on expression
sn/sf/se Sample selec,on by (name/names in a le/expression)

Other parameters of interest


ef Exclude ltered sites
env Exclude sites that are monomorphic aGer sample subseHng
xl_{sn/sf/se} Exclude (rather than include) samples
selectType Include only variants of type X (SNP, InDel, etc.)

Outputs
o The resul,ng VCF aGer selec,on criteria applied
Example: How many variants do I have in a cohort?
Task: Subset a VCF to a specic group of samples
Tool: SelectVariants
Data: Genotype VCF, sample list (le)

java -jar GATK.jar \


-R human_g1k_v37.fasta \
Only PASS and .
-T SelectVariants \ sites are output
-V 1000G_ALL_Genotypes.vcf.gz \
--sample_file 1000G_EUR_samples.list \
--excludeFiltered \
--excludeNonVariants \
-o 1000G_EUR_Genotypes.bcf Sites not variant for
1000G_EUR_samples
are removed
VCF consis,ng only of non-QC-Fail variants for
which at least one of the samples in
1000G_EUR_samples.list has a non-reference
genotype. The sites can be easily counted.
TOOL TIPS
Merge callsets with Combine Variants

Inputs
-V (mul,ple) the VCFs to combine

Other parameters of interest


-genotypeMergeOp,ons How to combine dierent input genotypes for
the same sample
-priority If genotypes are priori,zed, the priority order of the input VCFs

How to combine dierent input sites with


-lteredRecordsMergeType dierent lter status: lter if all inputs ltered,
or if at least one ltered
-lteredAreUncalled Treat sites ltered in an input VCF as
though they were not present

Outputs
-o A VCF with the sites, samples, and genotypes resul,ng from merging all of
the input VCF les given
How many variants are present in both of my cohorts?
Task: Calculate which variants are present in both callset #1 and callset #2
Tool: CombineVariants
Data: Site or Genotype VCFs for each cohort

java -jar GATK.jar \


-R human_g1k_v37.fasta \ Pretend QC-fail sites
arent even there
-T CombineVariants \
-V:COHORT1 1000G_EUR_Genotypes.bcf \
-V:COHORT2 testCallSet.chr20.vcf \
--filteredAreUncalled \
-o testCallSet.EURCombined.chr20.vcf \
-L 20

VCF consis,ng of all called, not-ltered sites and samples


from both input cohorts. In addi,on, each site labeled with
a set= key (in the info eld) describing which input set the
site is from (COHORT1,COHORT2, or Intersec,on). Without
--lteredAreUncalled, lter informa,on would be captured
(e.g. set=COHORT1-lterInCOHORT2) as well.
How many variants are (likely) private to my sample?
Task: Extract private variants from a callset
Tool: CombineVariants, SelectVariants
Data: Site VCFs for callset, dbSNP, and (as many as possible) external callsets
Note: this is a real ques2on, the results of which were presented at ASHG

java -jar GATK.jar R human_g1k_v37.fasta \


-T CombineVariants \
-V:ESP ESP6500.chr1.snps.vcf \
-V:T2D private.chr1.seq.imp.sites.vcf \
-V:1000G 1000G_ALL_Sites.chr1.vcf \
-o private.chr1.seq.imp.sites.COMBINED.vcf &&
java jar GATK.jar R human_g1k_v37.fasta \
-T SelectVariants
Step 1: G\
enerate a new VCF consis,ng of all sites in ESP,
-V private.chr1.seq.imp.sites.COMBINED.vcf
T2D, and 1000G. CombineVariants adds to the info \eld
-select set ==
for eT2D
ach variant \the input set from which it came, e.g.
set=ESP-T2D, set=Intersec,on, set=T2D, etc
-o private.chr1.seq.imp.sites.PRIVATE.vcf
How many variants are (likely) private to my sample?
Task: Extract private variants from a callset
Tool: CombineVariants, SelectVariants
Data: Site VCFs for callset, dbSNP, and (as many as possible) external callsets
Note: this is a real ques2on, the results of which were presented at ASHG

java -jar GATK.jar R human_g1k_v37.fasta \


-T
Step CombineVariants
2: Take the VCF from step \1, and extract only those
sites with ESP6500.chr1.snps.vcf
-V:ESP "set=T2D" (e.g. sites that came only \ from the
T2D call set).
-V:T2D These sites are present in T2D samples, but
private.chr1.seq.imp.sites.vcf \
not in 7,500
-V:1000G other samples (from 1000G and ESP) \
1000G_ALL_Sites.chr1.vcf
-o private.chr1.seq.imp.sites.COMBINED.vcf &&
java jar GATK.jar R human_g1k_v37.fasta \
-T SelectVariants \
-V private.chr1.seq.imp.sites.COMBINED.vcf \
-select "set == 'T2D'" \
-o private.chr1.seq.imp.sites.PRIVATE.vcf

VCF consis,ng only of those variant sites in the


"T2D" call set that are not present in 1000G or
the 6,500 samples in ESP

EVALUATING A CALLSET
Variant callset evalua,on metrics
Planning an evalua,on

Does my callset match my expecta,ons given the region and


number of samples sequenced?
Strive to tailor analysis to the project
My callset: 62 samples from N Europe, WGS
unknown technology and informa,cs pipeline

Comparing to all of 1000G (which contains African samples) is


not appropriate
Compare to 62 1000G FIN samples

Not capture data, so there are no targets to specify


But, for speed, restrict analysis to chr20
TOOL TIPS
VariantEval

Inputs
eval (mul,ple) the call set(s) to be evaluated
comp (mul,ple) the call set(s) to use as comparators
D dbSNP track
Other parameters of interest
EV (mul,ple) addi,onal evalua,on module(s) to use
ST (mul,ple) addi,onal stra,ca,on(s) to use
Outputs
o A GATKReport text le containing tables of evalua,on results
Output of Variant Eval

Consists of a sequence of GATK Report tables as below:

Parsing format

Table name and descrip,on

Table Stra,ca,ons: eval set, comp set,


Name Tabulated results (counts in this case)
select expressions, novelty
Do I have a good callset?
Task: Evaluate the quality of a callset
Tool: VariantEval
Data: Genotype VCFs for evalua,on and comparison

java -jar GATK.jar \


-R human_g1k_v37.fasta L 20 \
-T VariantEval \
-eval:myCalls testCallSet.chr20.vcf \
-eval:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-eval:1000G_FIN 1000G_62_FIN_Genotypes.bcf \
-comp:OMNI_POLY_SITES omni_poly.vcf \
-comp:OMNI_MONO_SITES omni_mono.vcf \
-comp:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-D dbsnp_137_129_sites.vcf \
-o myCalls_alongside_1000G.eval

Evalua,on report consis,ng of the standard evalua,on


modules (CountVariants, CompOverlap, TiTv, etc)
stra,ed by the standard stra,ers (Novelty, Filter), run
on the calls of interest, and 1000G_Fin for comparison
Do I have a good callset?
Task: Evaluate the quality of a callset
Tool: VariantEval
Data: Genotype VCFs for evalua,on and comparison

java -jar GATK.jar \


-R human_g1k_v37.fasta L 20 \
-T VariantEval \
-eval:myCalls testCallSet.chr20.vcf \
-eval:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-eval:1000G_EUR 1000G_EUR_Genotypes.bcf \
-eval:1000G_FIN 1000G_62_FIN_Genotypes.bcf \
Callset I want to
-comp:OMNI_POLY_SITES omni_poly.vcf \
evaluate
-comp:OMNI_MONO_SITES omni_mono.vcf \
-comp:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-D dbsnp_137_129_sites.vcf \
-o myCalls_alongside_1000G.eval
Do I have a good callset?
Task: Evaluate the quality of a callset
Tool: VariantEval
Data: Genotype VCFs for evalua,on and comparison

java -jar GATK.jar \


-R human_g1k_v37.fasta L 20 \
-T VariantEval \
-eval:myCalls testCallSet.chr20.vcf \
-eval:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-eval:1000G_EUR 1000G_EUR_Genotypes.bcf \
-eval:1000G_FIN 1000G_62_FIN_Genotypes.bcf \
1000 Genomes \sites:
-comp:OMNI_POLY_SITES omni_poly.vcf
good es,mates for
-comp:OMNI_MONO_SITES omni_mono.vcf \
expected Ti/Tv and indel
-comp:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz
size distribu,on
\
-D dbsnp_137_129_sites.vcf \
-o myCalls_alongside_1000G.eval
Do I have a good callset?
Task: Evaluate the quality of a callset
Tool: VariantEval
Data: Genotype VCFs for evalua,on and comparison

java -jar GATK.jar \


-R human_g1k_v37.fasta L 20 \
-T VariantEval \
-eval:myCalls testCallSet.chr20.vcf \
-eval:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-eval:1000G_FIN 1000G_62_FIN_Genotypes.bcf \
-comp:OMNI_POLY_SITES omni_poly.vcf \
-comp:OMNI_MONO_SITES omni_mono.vcf \
Size-matched northern
-comp:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-D dbsnp_137_129_sites.vcfEuropean
\ samples: good
expecta,on for number
-o myCalls_alongside_1000G.eval
of calls, novelty, etc
Do I have a good callset?
Task: Evaluate the quality of a callset
Tool: VariantEval
Data: Genotype VCFs for evalua,on and comparison

java -jar GATK.jar \


-R human_g1k_v37.fasta L 20 \
-T VariantEval \
-eval:myCalls testCallSet.chr20.vcf \
-eval:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-eval:1000G_FIN 1000G_62_FIN_Genotypes.bcf \
-comp:OMNI_POLY_SITES omni_poly.vcf \
-comp:OMNI_MONO_SITES omni_mono.vcf \
-comp:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-D dbsnp_137_129_sites.vcf \
-o myCalls_alongside_1000G.eval Polymorphic and
monomorphic sites on the
1000G genotype chip (proxy
for sensi,vity, specicity)
Do I have a good callset?
Task: Evaluate the quality of a callset
Tool: VariantEval
Data: Genotype VCFs for evalua,on and comparison

java -jar GATK.jar \


-R human_g1k_v37.fasta L 20 \
-T VariantEval \
-eval:myCalls testCallSet.chr20.vcf \
-eval:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-eval:1000G_FIN 1000G_62_FIN_Genotypes.bcf \
-comp:OMNI_POLY_SITES omni_poly.vcf \
-comp:OMNI_MONO_SITES omni_mono.vcf \
-comp:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-D dbsnp_137_129_sites.vcf \
-o myCalls_alongside_1000G.eval

dbSNP 129 and 1000G as


catalogues of known
polymorphic varia,on
Do I have a good callset?
Task: Evaluate the quality of a callset
Tool: VariantEval
Data: Genotype VCFs for evalua,on and comparison

Parsing&format&

Table&name&and&descrip/on&

Table& Stra/ca/ons:&eval&set,&comp&set,&
Name& Tabulated&results&(counts&in&this&case)&
select&expressions,&novelty&

# SNPs # SNPs Known Novel Ti/ # Indels # Indels # Omni


(known) (novel) Ti/Tv Tv (known) (novel) Poly
myCalls 136,281 72,718 2.41 2.41 9,654 12,805 34,302
FIN (62) 148,866 102,232 2.33 2.41 7,304 12,219 38,093
Do I have a good callset?
Task: Evaluate the quality of a callset
Tool: VariantEval
Data: Genotype VCFs for evalua,on and comparison
java -jar GATK.jar \ 1
-R human_g1k_v37.fasta L 20 \ 2
-T VariantEval \ 3 Parsing&format&
-eval:myCalls testCallSet.chr20.vcf \ 4 Table&name&and&descrip/on&
-eval:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \ 5
-eval:1000G_FIN 1000G_62_FIN_Genotypes.bcf \ 6
-comp:OMNI_POLY_SITES omni_poly.vcf \ 7
-comp:OMNI_MONO_SITES omni_mono.vcf \ 8
-comp:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \ 9
-D dbsnp_137_129_sites.vcf \ 10 Table& Stra/ca/ons:&eval&set,&comp&set,&
Name& Tabulated&results&(counts&in&this&case)&
-o myCalls_alongside_1000G.eval 11 select&expressions,&novelty&

Overltering?
Contamina,on?

# SNPs # SNPs Known Novel Ti/ # Indels # Indels # Omni


(known) (novel) Ti/Tv Tv (known) (novel) Poly
myCalls 136,281 72,718 2.41 2.41 9,654 12,805 34,302
FIN (62) 148,866 102,232 2.33 2.41 7,304 12,219 38,093
Do I have a good callset pt. 2: do samples look consistent?
Task: Evaluate the per-sample quality of a callset
Tool: VariantEval
Data: Genotype VCFs for evalua,on and comparison

java -jar GATK.jar \


-R human_g1k_v37.fasta L 20 \
-T VariantEval \
-eval:myCalls testCallSet.chr20.vcf \
-eval:1000G_FIN 1000G_62_FIN_Genotypes.bcf \
-comp:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-D dbsnp_137_129_sites.vcf \
-ST Sample -noEV -EV CountVariants \
-EV TiTvVariantEvaluator -EV CompOverlap \
-o myCalls_alongside_1000G.bySample.eval

Evalua,on report consis,ng of the standard evalua,on


modules (CountVariants, CompOverlap, TiTv, etc)
stra,ed by the standard stra,ers (Novelty, Filter), and
then by sample, so values are reported for every sample.
The evalua,on report contains an extra stra,ca,on
column, propagated with the sample name.
Do I have a good callset pt. 2: do samples look consistent?
Task: Evaluate the per-sample quality of a callset
Tool: VariantEval
Data: Genotype VCFs for evalua,on and comparison
Novel Variant Density
Per-sample, novel variant counts look
very dierent between the test callset
14

testCalls
FIN
and the matched 1000G comparison set:

12

Fewer per-sample novel variants on


average, with a heavy leO-skew (some
10

samples have very few novel variants).



Frequency

Could be indica,ve of over-ltering or


low depth.
6


Other metrics line up well, so calls are
4

probably okay, but a bit less sensiQve


2

than the gold standard.


0

4000 6000 8000 10000 12000

Number of novel variants



PHENOTYPIC INFERENCE
Op,ons for QC through phenotypic inference

Kinship
-> degree of rela,on between samples (King / PLINK)

Pedigree
-> reconstruct family structure (trios)

Sex
-> coverage / clustering analysis over X and Y
Many projects discard samples with non-standard sex genotypes (e.g. X0, XXY)

Ethnicity inference
-> PCA + clustering on subset of conserved sites (S. Purcell)

These methods developed for GWAS can be used for QC purposes, e.g. to check
idenQty and verify supplied metadata, as well as adjust variant QC expectaQons
Pairwise kinship inference (King / PLINK)

Duplicates

Parent-
Ospring Siblings
Monkol Lek, 2014
Monkol Lek, 2014

24K
count

21K

18K
1000G SNP

ATV
Bipolar
BUP

ESP

Ovawa
NFBC

SCZ

T2D-GENES
GoT2D
Ethnicity aects many variant call metrics

SIGMA

TAT
Older popula,ons tend to display more heterogeneity
We are here in the Best Practices workflow
Variant evalua,on
talks

Further reading
hvp://www.broadins,tute.org/gatk/guide/best-prac,ces

hvp://www.broadins,tute.org/gatk/guide/ar,cle?id=51

hvp://www.broadins,tute.org/gatk/guide/ar,cle?id=48

hvp://www.broadins,tute.org/gatk/guide/ar,cle?id=53

hvp://www.broadins,tute.org/gatk/guide/ar,cle?id=54

hvp://www.broadins,tute.org/gatk/gatkdocs/#VariantEvalua,onandManipula,onTools

Vous aimerez peut-être aussi