Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

(howto) Recalibrate variant quality scores = run VQSR

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin
edited December 2013 in Tutorials

Objective

Recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.

Prerequisites

  • TBD

Caveats

This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document.

Steps

  1. Prepare recalibration parameters for SNPs
    a. Specify which call sets the program should use as resources to build the recalibration model
    b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
    c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
    d. Determine additional model parameters

  2. Build the SNP recalibration model

  3. Apply the desired level of recalibration to the SNPs in the call set

  4. Prepare recalibration parameters for Indels a. Specify which call sets the program should use as resources to build the recalibration model b. Specify which annotations the program should use to evaluate the likelihood of Indels being real c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches d. Determine additional model parameters

  5. Build the Indel recalibration model

  6. Apply the desired level of recalibration to the Indels in the call set


1. Prepare recalibration parameters for SNPs

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

  • True sites training resource: HapMap

This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

  • True sites training resource: Omni

This resource is a set of polymorphic SNP sites produced by the Omni genotyping array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

  • Non-true sites training resource: 1000G

This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this resource may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%).

  • Known sites resource, not used in training: dbSNP

This resource is a SNP call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

  • Coverage (DP)

Total (unfiltered) depth of coverage.

  • QualByDepth (QD)

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

  • FisherStrand (FS)

Phred-scaled p-value using Fisher's Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

  • MappingQualityRankSumTest (MQRankSum)

The u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

  • ReadPosRankSumTest (ReadPosRankSum)

The u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

  • First tranche threshold 100.0

  • Second tranche threshold 99.9

  • Third tranche threshold 99.0

  • Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.


2. Build the SNP recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ 
    -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \ 
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ 
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ 
    -an DP \ 
    -an QD \ 
    -an FS \ 
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -mode SNP \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -rscriptFile recalibrate_SNP_plots.R 

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_SNP.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_SNP.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.


3. Apply the desired level of recalibration to the SNPs in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -mode SNP \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -o recalibrated_snps_raw_indels.vcf 

Expected Result

This creates a new VCF file, called recalibrated_snps_raw_indels.vcf, which contains all the original variants from the original raw_variants.vcf file, but now the SNPs are annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.


4. Prepare recalibration parameters for Indels

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

  • Known and true sites training resource: Mills

This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

  • Coverage (DP)

Total (unfiltered) depth of coverage.

  • FisherStrand (FS)

Phred-scaled p-value using Fisher's Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

  • MappingQualityRankSumTest (MQRankSum)

The u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

  • ReadPosRankSumTest (ReadPosRankSum)

The u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

  • First tranche threshold 100.0

  • Second tranche threshold 99.9

  • Third tranche threshold 99.0

  • Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

d. Determine additional model parameters

  • Maximum number of Gaussians (-maxGaussians) 4

This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.


5. Build the Indel recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -resource:mills,known=true,training=true,truth=true,prior=12.0 mills.vcf \ 
    -an DP \ 
    -an FS \ 
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -mode INDEL \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    --maxGaussians 4 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -rscriptFile recalibrate_INDEL_plots.R 

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_INDEL.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_INDEL.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.


6. Apply the desired level of recalibration to the Indels in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -mode INDEL \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -o recalibrated_variants.vcf 

Expected Result

This creates a new VCF file, called recalibrated_variants.vcf, which contains all the original variants from the original recalibrated_snps_raw_indels.vcf file, but now the Indels are also annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • avidLearneravidLearner Posts: 9Member

    Hello, I have a few questions that I would like to clarify:

    1. Is the file 1000G_phase1.snps.high_confidence.vcf available in the resource bundle? (I looked it up and it wasn't there!) If not, which file is recommended with the updated VariantRecalibrator parameters?

    2. Just to clarify, are the parameters described above pertinent to whole genomes or whole exomes? (I am assuming whole genomes but this has not been mentioned explicitly anywhere.) I understand that it is recommended to use --maxGaussians 4 --percentBad 0.05 for few exome samples. In this case, is -minNumBad of 1000 recommended or should a lower value be set?

    3. The VariantRecalibrator documentation parameters (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html) are slightly different from the parameters described here. Is this tutorial the latest best practices guideline that can be followed for VQSR?

    Thanks for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Hi there,

    • The filenames are a little different from what we have in the bundle, I will correct that soon (sorry) but basically you want the 1000 Genomes Phase1 SNPs that should be in our bundle.

    • These parameters are for WGS. I will add a note to explain that.

    • The parameters given in tutorial documents are not necessarily up-to-date with our Best Practices recommendations. These can be found in the Best Practices document on our website that concentrates (or points to) the actual parameter recommendations. For VQSR there is a specific FAQ document, which is linked to by the BP document. We will try to make that more clear in the documentation.

    Geraldine Van der Auwera, PhD

  • avidLearneravidLearner Posts: 9Member

    Hello Geraldine, I searched for the 1000 Genomes Phase 1 SNPs VCF in ftp://ftp.broadinstitute.org/bundle/2.3/b37/ but no luck! Please tell me if I'm looking in the wrong place.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Hey @avidLearner, sorry to get back to you so late. You've been looking in the version 2.3 of the bundle, but you should be looking in the latest version, which is 2.5.

    Geraldine Van der Auwera, PhD

  • avidLearneravidLearner Posts: 9Member

    Hi Geraldine, yes I figured it out. I was initially accessing ftp://ftp.broadinstitute.org/bundle/ through Internet Explorer and for some reason it wasn't loading the updated version 2.5. But when I accessed the resource bundle through wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/ I was able to locate the latest version. Sorry for the trouble. Thanks for the help!

  • yingzhangyingzhang minneapolisPosts: 4Member
    edited August 2013

    This is a really useful article on the variant recalibration. And I think it makes more sense compared to the brief overview at http://www.broadinstitute.org/gatk/guide/topic?name=best-practices

    But I do have a question, which might only be my own over-reacting.

    In the overview, you state:

    =======

    We've found that the best results are obtained with the VQSR when it is used to build separate adaptive error models for SNPs versus INDELs:

    snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP)
    indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL)
    recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP)
    analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)
    

    ========

    The difference between the overview and this FAQ is that how one should build the error model for indel.

    In overview, it builds the error model for indel from the raw_variant.vcf; while in this article the indel model is built from the recalibrated_snps_raw_indels.vcf.

    I assume there should be no cross-talk between the INDEL and SNP types. But if there is such a case, then the indel model from raw_variant.vcf will differ from the indel model from the recalibrated_snps_raw_indels.vcf.

    Should I follow the overview or this FAQ?

    And when you say "best results", what else have you tested?

    Best, Ying

    Post edited by yingzhang on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Hi Ying,

    We're currently rewriting the Best Practices document to make it more readable, so hopefully that will help clarify things. In the meantime, let me try to clear up some of the confusion, which is very common, about how the recalibration works.

    When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round.

    This replaces the old way of proceeding, where we recalibrated indels and SNPs in separate files, because it is faster and more efficient.

    Does that make more sense?

    Geraldine Van der Auwera, PhD

  • yingzhangyingzhang minneapolisPosts: 4Member

    Thank you Geraldine. Your comments are really explicit.

    @Geraldine_VdAuwera said: Hi Ying,

    We're currently rewriting the Best Practices document to make it more readable, so hopefully that will help clarify things. In the meantime, let me try to clear up some of the confusion, which is very common, about how the recalibration works.

    When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round.

    This replaces the old way of proceeding, where we recalibrated indels and SNPs in separate files, because it is faster and more efficient.

    Does that make more sense?

  • avilellaavilella Posts: 6Member

    Hi @Geraldine_VdAuwera,

    Is this two step process described below also working for version 1.6? (I am using v1.6-22-g3ec78bd)

    "When you specify either INDEL or SNP for the model, the model is built using only the specified type of variant, ignoring the other type. Then, when you apply the recalibration (specifying the appropriate variant type), the tool output recalibrated variants for that type, and emits the other variant type without modification. So if you specified SNPs, your output contains recalibrated SNPs and unmodified indels. You can then repeat the process on that output file, specifying INDEL this time, so that the tool will output recalibrated indels, as well as the already recalibrated SNPs from the previous round."

    Or is 1.6 still using the separate process?

    "old way of proceeding, where we recalibrated indels and SNPs in separate files"

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    With 1.6 you should recalibrate variants separately (the old way).

    FYI version 2.7 has some key improvements to VQSR that make it work better and more consistently. We really don't recommend using older versions.

    Geraldine Van der Auwera, PhD

  • avilellaavilella Posts: 6Member
    edited September 2013

    Can I recalibrate each of the chromosomes as separate vcf, each time pointing to the whole resource files? Will it make a huge difference compared to merging the chromosome vcfs and recalibrating only once?

    Post edited by avilella on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    It's much better to recalibrate all the chromosomes together because it gives the VQSR model more data, which will make it more accurate and reliable. In many cases individual chromosomes don't even have enough variants for recalibration to work at all.

    Geraldine Van der Auwera, PhD

  • prepagamprepagam Posts: 18Member

    Can I check VQSR should be done on each population separately like Unified Genotyper? I have 40 samples but from many populations - so probably not enough data for VQSR if its by population.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Hi @prepagam,

    You should run VQSR jointly only on samples that were called jointly; so if you called your samples in subsets by population, you also have to recalibrate them according to the same subsets.

    Geraldine Van der Auwera, PhD

  • splaisansplaisan Posts: 14Member

    Further in the workflow, I noticed the now fixed --maxGaussians 4 ( not yet fixed in the d. text part ) and that the tranche followed by a range notation was not accepted on my machine, is this normal?

    # fails even with single or double quotes around []
        -tranche [100.0, 99.9, 99.0, 90.0] \ 
    # works with
        -tranche 100.0 \
        -tranche 99.9 \
        -tranche 99.0 \
        -tranche 90.0 \
    

    thanks

    Stephane

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Hi @splaisan,

    You are correct, the way you got it to work is the right way to do it. I'll rephrase it in the doc since it seems to be confusing people. Thanks for pointing this out!

    Geraldine Van der Auwera, PhD

  • splaisansplaisan Posts: 14Member

    sad, the bracket notation was cool! ;-)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Yep. Maybe if we find time we'll amend how the arguments can be passed in... (but don't hold your breath)

    Geraldine Van der Auwera, PhD

  • splaisansplaisan Posts: 14Member

    Inspired by the broadE 2013 video#5 by Ryan Poplin (maybe a question for Ryan!) - great video BTW Context: I took IIlumina NA18507 chr21 mappings (originally to hg18) where most of the bam header is missing and poorly complies to Picard+GATK. I first extracted the reads to FASTQ to remap them to hg19 and then followed the full Best practice pipeline.

    Q: When such data, should I add one @RG corresponding to each original lane present in the hg18 bam file (thefore exporting each lane to a separate fastQ) in order to take advantage of the lane bias detection of the recalibration? or will the effect be minimal and not justifying the long run?

    THX

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Yes, you should definitely assign RGs per original lane for best recalibration results. It is totally worth the extra effort, because different lanes are likely to suffer from different error modes.

    Keep in mind also that when reverting to FASTQ, it's important to randomize the reads in order to avoid biasing the aligner. I forget if you're the person I recently discussed this with on this forum; if not I recommend you check out our tutorial on reverting bams to Fastq.

    Geraldine Van der Auwera, PhD

  • splaisansplaisan Posts: 14Member

    Hi again, About -an parameters.

    When the HC has been used without specifying annotation types (as the case in the separate tut), do we always get the 'an' types described above? Is this the full list or some best practice defaults? Is there a rationale on which to be used in genome seq and what about the mirror question for the UG. Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    There is a set of core annotations that are used by default for both UG and HC. They typically don't change across versions. They are the "Standard" annotations; you can get a list by running VariantAnnotator with the -list argument, or if you want to check if a specific annotation is in the standard set, you can look it up on its tech doc page, e.g. here toward the top of the page, where it says "Type:". In the current defaults, the only difference between UG and HC is that HC additionally uses ClippingRankSumTest and DepthPerSampleHC.

    The choice of the default set (and any other recommendation we may make in the documentation) is based largely on empirical observations of what annotations are consistently informative and useful for filtering in our work. The caveat is that most of our work is done on specific data types (e.g. well-behaved Illumina exomes) so if you're using different data types you may need to experiment with other annotations. But we try to make our recommendations as broadly applicable as possible.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 47Member

    Hello there, Wondering what would be a sensible option for -numBad. You say "The default value is 1000, which is geared for smaller datasets. If you have a dataset that's on the large side, you may need to increase this value considerably."

    I have ~ 1,100 samples, and I set -numBad to 10,000. Would that be reasonable? I am working with whole exome data. Your previous recommendations had "-percentBad 0.01 -minNumBad 1000". Does the new numBad replace both of these ? Cheers!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Yes, the new -numBad argument replaces the other two.

    The important metric for setting numBad is not really the number of samples, but more so the number of variants initially called.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 47Member

    Right, of course. So for ~ 900,000 variants, would 10,000 suffice for numBad (or how does one determine the 'ideal' number)?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    It's hard to say up front because it depends a lot on your data, but I think 10K out of 900K is a good place to start, though I would recommend doing several runs with increasing numbers, and choosing the one that gives the best-looking plots. The good news is we are working on a way to have the recalibrator calculate the optimal number automatically to take the guesswork out of this process...

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 47Member

    Hi there, another question on VQSR training sets / options. This page suggests the following: -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ -resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum \ -mode SNP -numBad 1000 \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0

    While the section on "What VQSR training sets / arguments should I use for my specific project?" says --numBadVariants 1000 \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \ -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum \

    Both of these pages are on the current best practices page. For omni, the first suggests truth=false, while the latter says truth=true. I assume one of them is wrong - please advise which is correct. Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Hi @vsvinti,

    Thanks for pointing that out, I understand is is confusing. For the question of which document is correct, the general rule is that tutorials are only meant to provide usage examples and do not necessarily include the latest Best Practices parameter values. Though in this case actually it is a mistake in the command line; if you look at the parameter description earlier in the text Omni is correctly identified as a truth set. I'll fix this in the text.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 47Member
    edited November 2013

    Thanks for that. I have a general tendency to consider anything I find on the best practices as the updates to go by. Since the tutorial link came first, I thought they are the new updates. I wonder if there's a way to point out in best practices which pages contain the latest option recommendations, and which don't. I generally don't read of all the subsections if I thought I found the answer. Thanks.

    Post edited by vsvinti on
  • vsvintivsvinti Posts: 47Member

    How about for indels? Tutorial only has mills (all set to true), while the methods article has -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    I wonder if there's a way to point out in best practices which pages contain the latest option recommendations, and which don't

    I'll try to think of something to make this clearer.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    The methods article is always right over the tutorial. For the record, the difference it makes is very minor -- if you use the dbsnp file as known=true, that's what the recalibrator will use to annotate variants as known or novel. Otherwise it will use the Mills set. But this does not play any role in the recalibration model calculations.

    Geraldine Van der Auwera, PhD

  • rrahmanrrahman Posts: 5Member

    Should we add the -numBad parameter in the indel recalibration model as well for the whole exome datasets? or anything else? Thanks in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Hi @rrahman,

    The -numBad parameter has been removed from VQSR in the latest version (2.8), so unless you're using an older version, you should not try to use this parameter.

    Geraldine Van der Auwera, PhD

  • rrahmanrrahman Posts: 5Member

    so, for the 2.8 version, shall I use the exact recalibration model for both SNPs and INDELs for the whole exome datasets?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    I'm not sure what you mean by "exact recalibration model". Basically, follow the instructions as written in the article above.

    Geraldine Van der Auwera, PhD

  • rrahmanrrahman Posts: 5Member

    I meant the same parameters as shown in the above section. I was a bit confused cause in the earlier comments you said that they are for whole genome datasets. But anyways thanks for the clarification.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Oh I see what you mean. Yes, it's essentially the same thing, although for exomes you should specify a list of intervals (which you should already be doing for previous steps as well). And for the ti/tv-based tranche plots to look good, you may need to adjust the expected Ti/Tv ratio parameter.

    Geraldine Van der Auwera, PhD

  • rrahmanrrahman Posts: 5Member

    yes, I took care for the intervals in the previous steps already and thanks again for the pointing out the expected Ti/Tv ratio parameter.

    Cheers Rubayte

  • SerenaRhieSerenaRhie Posts: 2Member

    On the FTP for resource bundle, I see dbsnp_138.b37.excluding_sites_after_129.vcf.gz and dbsnp_138.b37.vcf.gz. Which one should I use for resource:dbsnp? Are there some cases to use excluding_sites_after_129 and other cases for 138?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    @SerenaRhie,

    You can safely use either as a resource:dbsnp for recalibration, as they will not affect the filtering of your variants. The difference is that if you use the full 138 version, more of your variants will be likely to be annotated as "known" (as opposed to "novel"), so if you are doing any meta-analysis that compares numbers of known vs. novel variants, you'll need to keep this effect in mind.

    Geraldine Van der Auwera, PhD

  • sfbarclaysfbarclay Posts: 3Member

    @Geraldine_VdAuwera said: it's essentially the same thing, although for exomes you should specify a list of intervals (which you should already be doing for previous steps as well). And for the ti/tv-based tranche plots to look good, you may need to adjust the expected Ti/Tv ratio parameter.

    I haven't come across this recommendation for working with exomes before (that I should be specifying a list of intervals). Can you please point me to the documentation I should read to better understand? Thank you! Sarah

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Hi @sfbarclay,

    We don't have documentation specifically on that (though perhaps we should -- I'll put it on the todo list) but here is a brief explanation.

    When you produce exome sequence data, by design the data only covers specific target sequences --for the most part. But there is also some off-target sequence data, which is typically not well mapped, and mostly produces false positive calls. So if you don't restrict your analysis to the regions targeted by the sequence prep, you enrich your callset for false positives. That can mess up filtering and callset evaluation, because QC metrics like titv ratio will be skewed by the false positives.

    Make sense?

    Geraldine Van der Auwera, PhD

  • sfbarclaysfbarclay Posts: 3Member

    Hi Geraldine, Yes, that makes prefect sense, thank you. My variant lists were a lot bigger than I thought they should be and I'm guessing this is why. So in the HaplotypeCaller, I should use the --activeRegionIn argument, and specify a bed file corresponding to my target regions. Is that right? And there wouldn't be anything additional to specify in the Varaint Recalibration step? Thanks! Sarah

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    @sfbarclay, not quite. The argument for passing intervals is -L. It's an engine argument so you'll find it in the CommandLineGATK tool documentation. You should also look up the interval padding argument which is also recommended for best results.

    The --activeRegionIn argument is a specialized arg with a very different purpose (more so troubleshooting).

    You don't have to use -L in subsequent steps but in some cases it can speed up processing to do so, so it doesn't hurt to keep using it.

    Geraldine Van der Auwera, PhD

  • ChrisPattersonChrisPatterson Epilepsy Genomics CenterPosts: 7Member

    I'm trying to run the VariantRecalibrator on my Indels. I am using the Mills_and_1000G_gold_standard.indels.hg19.vcf as my resource (downloaded from your research bundle), as is suggested in the Guide. However, the INFO field contains no annotations, so the program isn't able to create the recalibration model. And there are no samples within the vcf just references to the original data set they were taken from, so VariantAnnotator isn't capable of filling in the missing information. Have I misunderstood which resource files I should use for this tool?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    @ChrisPatterson‌ VariantRecalibrator doesn't use sample information, so those resource files are okay to use. Can you check that you got the right version of the file (e.g. hg19 vs b37), to match the reference build version that was used to map your data? Currently the tool does not check whether the reference versions match, and if you are using different builds, the contig names are different, so the tool will complain that it cannot find any overlapping variants with the right annotations. But that's not actually the problem -- the reference mismatch is the problem. We're going to fix that for the next version of course.

    Geraldine Van der Auwera, PhD

  • ChrisPattersonChrisPatterson Epilepsy Genomics CenterPosts: 7Member

    I had seen your response to an earlier question where they were having the same problem, so checking the build version was one of my first steps.

    I figured out where I made my mistake. I was using the HaplotypeCaller to call variants initially, which called both SNPs and INDELs together, but on our system, the processing time for a whole exome was going to be in weeks, so I switched over to the UnifiedGenotyper which calls SNPs and INDELs separately, so my VCF file I was trying to recalibrate for INDELs didn't actually have any INDELs in it to recalibrate.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    Ah, that makes sense. Glad you were able to track down the issue.

    Geraldine Van der Auwera, PhD

  • caligiuricaligiuri Posts: 2Member

    Hi everyone! I just started an exome sequencing project, and this will surely be a very easy to solve problem, but I'm stacked at the VSQR step. At the moment I'm working with just one sample, to test the pipeline I set up but I understand is not possible to run VSQR with few samples. For this reason I'm trying to download VCF files from 1000 Genomes and add it to mine. I've few questions though: first of all, the sequences were obtained with a different platform than mine (Illumina GA IIx) and I couldn't find any informations about how the variants were called (I used HaplotypeCaller). Could it be safe to run my sample together with these then? And second in the 1000 genomes ftp folder there are different VCFs, more than one for each chr and also one for autosomes. Should I merge them all together? Or would you suggest some other source for the VCF where I can find a single file already merged (or BAM in case it's better to start from them)? Thanks in advance for the help you can give me!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    @caligiuri As explained elsewhere on this site, you should start from the bams, not the VCFs. Platform doesn't matter much. Better to choose the samples based on ethnicity (to match yours). Be sure to look up our new workflow where variants are called per-sample followed by joint genotyping.

    Geraldine Van der Auwera, PhD

  • caligiuricaligiuri Posts: 2Member
    edited August 5

    Thanks a lot for your fast reply, Geraldine! I'll do as the new workflow suggests... I just have one more question about it, should I run HaplotypeCaller on each .bam separately? Will it take me long time to do it one at a time? Again thanks!

    Post edited by caligiuri on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,973Administrator, GATK Developer admin

    @caligiuri‌ Yes, you run HC on each bam separately (with -ERC GVCF and a few additional arguments; see docs). It takes time, sure, but less than trying to run them all together (because HC's runtime increases exponentially as you add samples to a multisample analysis). And the good thing is that if you have a cluster (or several machines available) you can run the samples in parallel.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.