GATKwr8 V 3 Evaluating Variants

talks
Variant evalua,on
Tools and methods for handling and

comparing variant callsets
We are here in the Best Practices workflow
Variant evalua,on

HANDLING CALLSETS
TOOL TIPS
Create subsets with Select Variants
Inputs
V Original VCF le
select A selec,on expression
sn/sf/se Sample selec,on by (name/names in a le/expression)
Other parameters of interest

ef Exclude ltered sites
env Exclude sites that are monomorphic aGer sample subseHng
xl_{sn/sf/se} Exclude (rather than include) samples
selectType Include only variants of type X (SNP, InDel, etc.)
Outputs
o The resul,ng VCF aGer selec,on criteria applied
Example: How many variants do I have in a cohort?
Task: Subset a VCF to a specic group of samples
Tool: SelectVariants
Data: Genotype VCF, sample list (le)
java -jar GATK.jar \

-R human_g1k_v37.fasta \
Only PASS and .
-T SelectVariants \ sites are output
-V 1000G_ALL_Genotypes.vcf.gz \
--sample_file 1000G_EUR_samples.list \
--excludeFiltered \
--excludeNonVariants \
-o 1000G_EUR_Genotypes.bcf Sites not variant for
1000G_EUR_samples
are removed
VCF consis,ng only of non-QC-Fail variants for
which at least one of the samples in
1000G_EUR_samples.list has a non-reference
genotype. The sites can be easily counted.
TOOL TIPS
Merge callsets with Combine Variants
Inputs
-V (mul,ple) the VCFs to combine

-genotypeMergeOp,ons How to combine dierent input genotypes for
the same sample
-priority If genotypes are priori,zed, the priority order of the input VCFs
How to combine dierent input sites with

-lteredRecordsMergeType dierent lter status: lter if all inputs ltered,
or if at least one ltered
-lteredAreUncalled Treat sites ltered in an input VCF as
though they were not present
Outputs
-o A VCF with the sites, samples, and genotypes resul,ng from merging all of
the input VCF les given
How many variants are present in both of my cohorts?
Task: Calculate which variants are present in both callset #1 and callset #2
Tool: CombineVariants
Data: Site or Genotype VCFs for each cohort

-R human_g1k_v37.fasta \ Pretend QC-fail sites
arent even there
-T CombineVariants \
-V:COHORT1 1000G_EUR_Genotypes.bcf \
-V:COHORT2 testCallSet.chr20.vcf \
--filteredAreUncalled \
-o testCallSet.EURCombined.chr20.vcf \
-L 20
VCF consis,ng of all called, not-ltered sites and samples

from both input cohorts. In addi,on, each site labeled with
a set= key (in the info eld) describing which input set the
site is from (COHORT1,COHORT2, or Intersec,on). Without
--lteredAreUncalled, lter informa,on would be captured
(e.g. set=COHORT1-lterInCOHORT2) as well.
How many variants are (likely) private to my sample?
Task: Extract private variants from a callset
Tool: CombineVariants, SelectVariants
Data: Site VCFs for callset, dbSNP, and (as many as possible) external callsets
Note: this is a real ques2on, the results of which were presented at ASHG
java -jar GATK.jar R human_g1k_v37.fasta \

-T CombineVariants \
-V:ESP ESP6500.chr1.snps.vcf \
-V:T2D private.chr1.seq.imp.sites.vcf \
-V:1000G 1000G_ALL_Sites.chr1.vcf \
-o private.chr1.seq.imp.sites.COMBINED.vcf &&
java jar GATK.jar R human_g1k_v37.fasta \
-T SelectVariants
Step 1: G\
enerate a new VCF consis,ng of all sites in ESP,
-V private.chr1.seq.imp.sites.COMBINED.vcf
T2D, and 1000G. CombineVariants adds to the info \eld
-select set ==
for eT2D
ach variant \the input set from which it came, e.g.
set=ESP-T2D, set=Intersec,on, set=T2D, etc
-o private.chr1.seq.imp.sites.PRIVATE.vcf
How many variants are (likely) private to my sample?
Task: Extract private variants from a callset
Tool: CombineVariants, SelectVariants
Data: Site VCFs for callset, dbSNP, and (as many as possible) external callsets
Note: this is a real ques2on, the results of which were presented at ASHG
java -jar GATK.jar R human_g1k_v37.fasta \

-T
Step CombineVariants
2: Take the VCF from step \1, and extract only those
sites with ESP6500.chr1.snps.vcf
-V:ESP "set=T2D" (e.g. sites that came only \ from the
T2D call set).
-V:T2D These sites are present in T2D samples, but
private.chr1.seq.imp.sites.vcf \
not in 7,500
-V:1000G other samples (from 1000G and ESP) \
1000G_ALL_Sites.chr1.vcf
-o private.chr1.seq.imp.sites.COMBINED.vcf &&
java jar GATK.jar R human_g1k_v37.fasta \
-T SelectVariants \
-V private.chr1.seq.imp.sites.COMBINED.vcf \
-select "set == 'T2D'" \
-o private.chr1.seq.imp.sites.PRIVATE.vcf
VCF consis,ng only of those variant sites in the

"T2D" call set that are not present in 1000G or
the 6,500 samples in ESP

EVALUATING A CALLSET
Variant callset evalua,on metrics
Planning an evalua,on
Does my callset match my expecta,ons given the region and

number of samples sequenced?
Strive to tailor analysis to the project
My callset: 62 samples from N Europe, WGS
unknown technology and informa,cs pipeline
Comparing to all of 1000G (which contains African samples) is

not appropriate
Compare to 62 1000G FIN samples
Not capture data, so there are no targets to specify

But, for speed, restrict analysis to chr20
TOOL TIPS
VariantEval
Inputs
eval (mul,ple) the call set(s) to be evaluated
comp (mul,ple) the call set(s) to use as comparators
D dbSNP track
EV (mul,ple) addi,onal evalua,on module(s) to use
ST (mul,ple) addi,onal stra,ca,on(s) to use
Outputs
o A GATKReport text le containing tables of evalua,on results
Output of Variant Eval
Consists of a sequence of GATK Report tables as below:
Parsing format
Table name and descrip,on
Table Stra,ca,ons: eval set, comp set,

Name Tabulated results (counts in this case)
select expressions, novelty
Do I have a good callset?
Task: Evaluate the quality of a callset
Tool: VariantEval
Data: Genotype VCFs for evalua,on and comparison

-R human_g1k_v37.fasta L 20 \
-T VariantEval \
-eval:myCalls testCallSet.chr20.vcf \
-eval:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-eval:1000G_FIN 1000G_62_FIN_Genotypes.bcf \
-comp:OMNI_POLY_SITES omni_poly.vcf \
-comp:OMNI_MONO_SITES omni_mono.vcf \
-comp:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \
-D dbsnp_137_129_sites.vcf \
-o myCalls_alongside_1000G.eval
Evalua,on report consis,ng of the standard evalua,on

modules (CountVariants, CompOverlap, TiTv, etc)
stra,ed by the standard stra,ers (Novelty, Filter), run
on the calls of interest, and 1000G_Fin for comparison
Tool: VariantEval

-T VariantEval \
-eval:1000G_EUR 1000G_EUR_Genotypes.bcf \
Callset I want to
evaluate
Tool: VariantEval

-T VariantEval \
-eval:1000G_EUR 1000G_EUR_Genotypes.bcf \
1000 Genomes \sites:
-comp:OMNI_POLY_SITES omni_poly.vcf
good es,mates for
expected Ti/Tv and indel
-comp:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz
size distribu,on
\
Tool: VariantEval

-T VariantEval \
Size-matched northern
-D dbsnp_137_129_sites.vcfEuropean
\ samples: good
expecta,on for number
of calls, novelty, etc
Tool: VariantEval

-T VariantEval \
-o myCalls_alongside_1000G.eval Polymorphic and
monomorphic sites on the
1000G genotype chip (proxy
for sensi,vity, specicity)
Tool: VariantEval

-T VariantEval \
dbSNP 129 and 1000G as

catalogues of known
polymorphic varia,on
Tool: VariantEval
Parsing&format&
Table&name&and&descrip/on&
Table& Stra/ca/ons:&eval&set,&comp&set,&
Name& Tabulated&results&(counts&in&this&case)&
select&expressions,&novelty&
# SNPs # SNPs Known Novel Ti/ # Indels # Indels # Omni

(known) (novel) Ti/Tv Tv (known) (novel) Poly
myCalls 136,281 72,718 2.41 2.41 9,654 12,805 34,302
FIN (62) 148,866 102,232 2.33 2.41 7,304 12,219 38,093
Tool: VariantEval
java -jar GATK.jar \ 1
-R human_g1k_v37.fasta L 20 \ 2
-T VariantEval \ 3 Parsing&format&
-eval:myCalls testCallSet.chr20.vcf \ 4 Table&name&and&descrip/on&
-eval:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \ 5
-eval:1000G_FIN 1000G_62_FIN_Genotypes.bcf \ 6
-comp:OMNI_POLY_SITES omni_poly.vcf \ 7
-comp:OMNI_MONO_SITES omni_mono.vcf \ 8
-comp:1000G_ALL_SITES 1000G_ALL_Sites.vcf.gz \ 9
-D dbsnp_137_129_sites.vcf \ 10 Table& Stra/ca/ons:&eval&set,&comp&set,&
Name& Tabulated&results&(counts&in&this&case)&
-o myCalls_alongside_1000G.eval 11 select&expressions,&novelty&
Overltering?
Contamina,on?
# SNPs # SNPs Known Novel Ti/ # Indels # Indels # Omni

(known) (novel) Ti/Tv Tv (known) (novel) Poly
myCalls 136,281 72,718 2.41 2.41 9,654 12,805 34,302
FIN (62) 148,866 102,232 2.33 2.41 7,304 12,219 38,093
Do I have a good callset pt. 2: do samples look consistent?
Task: Evaluate the per-sample quality of a callset
Tool: VariantEval

-T VariantEval \
-ST Sample -noEV -EV CountVariants \
-EV TiTvVariantEvaluator -EV CompOverlap \
-o myCalls_alongside_1000G.bySample.eval
Evalua,on report consis,ng of the standard evalua,on

modules (CountVariants, CompOverlap, TiTv, etc)
stra,ed by the standard stra,ers (Novelty, Filter), and
then by sample, so values are reported for every sample.
The evalua,on report contains an extra stra,ca,on
column, propagated with the sample name.
Do I have a good callset pt. 2: do samples look consistent?
Task: Evaluate the per-sample quality of a callset
Tool: VariantEval
Novel Variant Density
Per-sample, novel variant counts look
very dierent between the test callset
14
testCalls
FIN
and the matched 1000G comparison set:

12
Fewer per-sample novel variants on

average, with a heavy leO-skew (some
10
samples have very few novel variants).

Frequency
Could be indica,ve of over-ltering or

low depth.
6

Other metrics line up well, so calls are
4
probably okay, but a bit less sensiQve

2
than the gold standard.

0
4000 6000 8000 10000 12000
Number of novel variants

PHENOTYPIC INFERENCE
Op,ons for QC through phenotypic inference
Kinship
-> degree of rela,on between samples (King / PLINK)
Pedigree
-> reconstruct family structure (trios)
Sex
-> coverage / clustering analysis over X and Y
Many projects discard samples with non-standard sex genotypes (e.g. X0, XXY)
Ethnicity inference
-> PCA + clustering on subset of conserved sites (S. Purcell)
These methods developed for GWAS can be used for QC purposes, e.g. to check
idenQty and verify supplied metadata, as well as adjust variant QC expectaQons
Pairwise kinship inference (King / PLINK)
Duplicates
Parent-
Ospring Siblings
Monkol Lek, 2014
Monkol Lek, 2014
24K
count
21K
18K
1000G SNP
ATV
Bipolar
BUP
ESP
Ovawa
NFBC
SCZ
T2D-GENES
GoT2D
Ethnicity aects many variant call metrics
SIGMA
TAT
Older popula,ons tend to display more heterogeneity
We are here in the Best Practices workflow
Variant evalua,on
talks
Further reading
hvp://www.broadins,tute.org/gatk/guide/best-prac,ces
hvp://www.broadins,tute.org/gatk/guide/ar,cle?id=51
hvp://www.broadins,tute.org/gatk/gatkdocs/#VariantEvalua,onandManipula,onTools

GATKwr8 V 3 Evaluating Variants

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

GATKwr8 V 3 Evaluating Variants

Transféré par

Droits d'auteur :

Formats disponibles

talks

Tools and methods for handling and

Other parameters of interest

java -jar GATK.jar \

Other parameters of interest

How to combine dierent input sites with

java -jar GATK.jar \

VCF consis,ng of all called, not-ltered sites and samples

java -jar GATK.jar R human_g1k_v37.fasta \

java -jar GATK.jar R human_g1k_v37.fasta \

VCF consis,ng only of those variant sites in the

Does my callset match my expecta,ons given the region and

Comparing to all of 1000G (which contains African samples) is

Not capture data, so there are no targets to specify

Consists of a sequence of GATK Report tables as below:

Table name and descrip,on

Table Stra,ca,ons: eval set, comp set,

java -jar GATK.jar \

Evalua,on report consis,ng of the standard evalua,on

java -jar GATK.jar \

java -jar GATK.jar \

java -jar GATK.jar \

java -jar GATK.jar \

java -jar GATK.jar \

dbSNP 129 and 1000G as

# SNPs # SNPs Known Novel Ti/ # Indels # Indels # Omni

# SNPs # SNPs Known Novel Ti/ # Indels # Indels # Omni

java -jar GATK.jar \

Evalua,on report consis,ng of the standard evalua,on

Fewer per-sample novel variants on

samples have very few novel variants).

Could be indica,ve of over-ltering or

probably okay, but a bit less sensiQve

than the gold standard.

4000 6000 8000 10000 12000

Number of novel variants

Vous aimerez peut-être aussi