The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# What VQSR training sets / arguments should I use for my specific project?

Posts: 122GATK Developer mod
edited September 22 in FAQs

This document describes the resource datasets and arguments that we recommend for use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on our work with human genomes, to comply with the GATK Best Practices. The recommendations detailed in this document take precedence over any others you may see elsewhere in our documentation (e.g. in Tutorial articles, which are only meant to illustrate usage, or in past presentations, which may be out of date).

The document covers:

• Explanation of resource datasets
• Important notes about exome experiments
• Argument recommendations for VariantRecalibrator
• Argument recommendations for ApplyRecalibration

These recommendations are valid for use with calls generated by both the UnifiedGenotyper and HaplotypeCaller. In the past we made a distinction in how we processed the calls from these two callers, but now we treat them the same way. These recommendations will probably not work properly on calls generated by other (non-GATK) callers.

Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs (see the VQSR documentation for more details).

### Explanation of resource datasets

The human genome training, truth and known resource datasets mentioned in this document are all available from our resource bundle.

If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties corresponding to those described below. To generate your own resource set, one idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP callers in addition to the UnifiedGenotyper or HaplotypeCaller, and use those sites which are concordant between the different methods as truth data. In either case, you'll need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.

#### Resources for SNPs

• True sites training resource: HapMap
This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

• True sites training resource: Omni
This resource is a set of polymorphic SNP sites produced by the Omni geno- typing array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

• Non-true sites training resource: 1000G
This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this re- source may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%). 17

• Known sites resource, not used in training: dbSNP
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

#### Resources for Indels

• Known and true sites training resource: Mills
This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

Some of the annotations included in the recommendations given below might not be the best for your particular dataset. In particular, the following caveats apply:

• Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.

• HaplotypeScore is a statistic produced by UnifiedGenotyper which should be used for SNPs if you're calling variants with UG. This statistic isn't produced by the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes with HC.

• The InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples please omit this annotation from the command line.

### Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

• Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.

• You can also try using the VQSR with the smaller variant callset, but experiment with argument settings (try adding --maxGaussians 4 to your command line, for example). You should only do this if you are working with a non-model organism for which there are no available genomes or exomes that you can use to supplement your own cohort.

### Argument recommendations for VariantRecalibrator

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

#### Common, base command line

This is the first part of the VariantRecalibrator command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

java -Xmx4g -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R path/to/reference/human_g1k_v37.fasta \
-input raw.input.vcf \
-recalFile path/to/output.recal \
-tranchesFile path/to/output.tranches \
-nt 4 \
[SPECIFY TRUTH AND TRAINING SETS] \
[SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
[SPECIFY WHICH CLASS OF VARIATION TO MODEL] \


#### SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle.

   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an DP -an InbreedingCoeff \
-mode SNP \


Please note that these recommendations are formulated for whole-genome datasets. For exomes, we do not recommend using DP for variant recalibration (see below for details of why).

Note also that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

#### Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.

   --maxGaussians 4 \
-resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
-an QD -an DP -an FS -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \
-mode INDEL \


Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

### Argument recommendations for ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

#### Common, base command line

This is the first part of the ApplyRecalibration command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.


java -Xmx3g -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R reference/human_g1k_v37.fasta \
-input raw.input.vcf \
-tranchesFile path/to/input.tranches \
-recalFile path/to/input.recal \
-o path/to/output.recalibrated.filtered.vcf \
[SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
[SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \


#### SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. We typically seek to achieve 99.5% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.5 \
-mode SNP \


#### Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. We typically seek to achieve 99.0% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.0 \
-mode INDEL \

Post edited by Geraldine_VdAuwera on
Tagged:

Geraldine Van der Auwera, PhD

• Posts: 4Member

Thank you very much Geraldine and GATK team for making these good documentation!

In this version here I see argument "-tranche" is not used for command "VariantRecalibrator" anymore. In the previous version it was used as shown at http://www.broadinstitute.org/gatk/guide/article?id=2805

Could you please confirm that for me? Thanks!

• Posts: 4Member

@frankfeng said: Thank you very much Geraldine and GATK team for making these good documentation!

In this version here I see argument "-tranche" is not used for command "VariantRecalibrator" anymore. In the previous version it was used as shown at http://www.broadinstitute.org/gatk/guide/article?id=2805

Could you please confirm that for me? Thanks!

I just checked the documentation for that "-tranche" argument, and saw "The default values are 100.0, 99.9, 99.0, and 90.0 ".

• Posts: 1Member

Dear Geraldine,

I found two conflicting suggestions in this page concerning the resources to be exploited for VQSR:

TUTORIAL -resource:mills,known=true,training=true,truth=true,prior=12.0 mills.vcf \

FAQ -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \

Which one shall we use?

Best,

Lorenzo

The FAQ supersedes the tutorial (all tutorial should be taken as usage examples only). Sorry for the confusion, we'll fix that ASAP.

Geraldine Van der Auwera, PhD

• Posts: 113Member

If I have 2000 samples - among which there are a lot of highly related samples - is it then best practice not to feed the InbreedingCoeff annotation to VR? I have a lot more than 10 founder samples among the 2000 samples.

I am happy to be pointed in the right direction: "See the 1000 Genomes Phase I release for more information." I read the supplementary material to the phase 1 paper, but it did not illuminate me any further.

This is the final question from me in a long time. I promise.

Sparse is a nice way to put it :)

I honestly do not know and will need to put this to our methods people. Can you give a little more detail about the composition of your cohort? I.e. X number of families of N members on average etc., which should allow them to give you some kind of recommendation.

Geraldine Van der Auwera, PhD

• Posts: 113Member
edited September 12

@Geraldine_VdAuwera, these are households with children, parents and grandparents living in the same household. The household size distribution for the 1986 samples is as follows (count, household size):

141 1
87 2
70 3
64 4
35 5
29 6
28 7
22 8
16 9
10 10
8 11
4 12
3 13
1 14
1 15
1 17
1 19


I would expect at least half of the samples to be related. I haven't calculated the IBD value distribution yet to confirm this. Our current decision is to not use the InbreedingCoeff annotation for VR, but I might not have understood exactly how it works. Thank you.

Post edited by tommycarstensen on

@tommycarstensen The word from the methods devs is that your decision to not use InbreedingCoeff for VQSR is correct given the degree of relatedness of your subjects.

Geraldine Van der Auwera, PhD

• Posts: 6Member

The resource bundle contains two versions of dbSNP: one with only variants up through version 129, the other containing all variants through version 138. Is there a recommendation on which of these to use? I'm going to guess that the full 138 version is best given that this resource is given a low prior and seems to used in order to increase sensitivity (thus it should be highly inclusive, even of low quality variants).