(howto) Recalibrate variant quality scores = run VQSR

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,173Administrator, GATK Dev admin
edited December 2014 in Tutorials

Objective

Recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.

Prerequisites

  • TBD

Caveats

This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document. See that document also for caveats regarding exome vs. whole genomes analysis design.

Steps

  1. Prepare recalibration parameters for SNPs
    a. Specify which call sets the program should use as resources to build the recalibration model
    b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
    c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
    d. Determine additional model parameters

  2. Build the SNP recalibration model

  3. Apply the desired level of recalibration to the SNPs in the call set

  4. Prepare recalibration parameters for Indels
    a. Specify which call sets the program should use as resources to build the recalibration model
    b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
    c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
    d. Determine additional model parameters

  5. Build the Indel recalibration model

  6. Apply the desired level of recalibration to the Indels in the call set


1. Prepare recalibration parameters for SNPs

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

  • True sites training resource: HapMap

This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

  • True sites training resource: Omni

This resource is a set of polymorphic SNP sites produced by the Omni genotyping array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

  • Non-true sites training resource: 1000G

This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this resource may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%).

  • Known sites resource, not used in training: dbSNP

This resource is a SNP call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of SNPs being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Estimation of the overall mapping quality of reads supporting a variant call.

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

  • First tranche threshold 100.0

  • Second tranche threshold 99.9

  • Third tranche threshold 99.0

  • Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.


2. Build the SNP recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ 
    -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \ 
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ 
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ 
    -an DP \ 
    -an QD \ 
    -an FS \ 
    -an SOR \ 
    -an MQ \
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -an InbreedingCoeff \
    -mode SNP \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -rscriptFile recalibrate_SNP_plots.R 

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_SNP.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_SNP.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the VQSR method documentation and presentation videos.


3. Apply the desired level of recalibration to the SNPs in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -mode SNP \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -o recalibrated_snps_raw_indels.vcf 

Expected Result

This creates a new VCF file, called recalibrated_snps_raw_indels.vcf, which contains all the original variants from the original raw_variants.vcf file, but now the SNPs are annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.


4. Prepare recalibration parameters for Indels

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

  • Known and true sites training resource: Mills

This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

  • First tranche threshold 100.0

  • Second tranche threshold 99.9

  • Third tranche threshold 99.0

  • Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

d. Determine additional model parameters

  • Maximum number of Gaussians (-maxGaussians) 4

This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.


5. Build the Indel recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -resource:mills,known=true,training=true,truth=true,prior=12.0 mills.vcf \ 
    -an QD \
    -an DP \ 
    -an FS \ 
    -an SOR \ 
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -an InbreedingCoeff
    -mode INDEL \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    --maxGaussians 4 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -rscriptFile recalibrate_INDEL_plots.R 

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_INDEL.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_INDEL.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.


6. Apply the desired level of recalibration to the Indels in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -mode INDEL \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -o recalibrated_variants.vcf 

Expected Result

This creates a new VCF file, called recalibrated_variants.vcf, which contains all the original variants from the original recalibrated_snps_raw_indels.vcf file, but now the Indels are also annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • KinMokKinMok LondonPosts: 9Member

    Hi @Geraldine_VdAuwera,

    I would like to ask for a reconfirmation that DP (coverage) should not be used in VQSR.
    This is explained in "discussion/1259/what-vqsr-training-sets-arguments-should-i-use-for-my-specific-project". However, comment in this page in Jan 8 said "same parameters as shown in the above section".

    Thank you for your help.

    (pardon if I have sent this twice as I cannot see my post after the first upload)

    Kin

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,173Administrator, GATK Dev admin

    Hi @KinMok‌,

    As explained in the document about VQSR training sets and arguments, DP should be used for whole genomes but not for exomes.

    Also, note the caveat at the start of this document:

    This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document.

    Geraldine Van der Auwera, PhD

  • yeinhornyeinhorn IsraelPosts: 5Member

    Hi @Geraldine_VdAuwera,
    1.Can you explain me why do I need a large number of samples to use VSQR? aren't the samples independent?
    2.does it matter if the samples are dependent (same family, same rare disease) or independent (random samples from random sources)?
    3.is there a minimum number of samples which i can use VSQR when using exome analysis and whole genome analysis.

    Thanks,
    Yaron

  • pdexheimerpdexheimer Posts: 483Member, GATK Dev, DSDE Dev mod

    @yeinhorn‌ - The key requirement is really number of variants, not number of samples. It's just that it's easier to add more samples than it is to increase variation...

    My understanding (which might be incomplete/wrong) is that the major underlying requirement is that you have a sufficient number of variants that overlap your set of "true" variants, so that the Gaussian model can be effectively constructed. You also need enough variants to have a meaningful "bad" set, which is also used in cluster discrimination.

    So yes, it does matter if your samples are from the same family, as they'll have a fewer number of total variants between them. The same rare disease shouldn't impact anything, since they should only have a couple of variants in common. I don't think the sample requirements are hugely different between exomes and whole genomes, I try really hard not to use fewer than 30 samples. You'll see improved performance with more samples, I'm not sure where diminishing returns kick in.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,173Administrator, GATK Dev admin

    To add only a tiny bit to @pdexheimer 's excellent answer, you typically need somewhat fewer whole genomes compared to whole exomes because the former contain more variants than the latter. However, because the known datasets used as truth sets tend to mostly comprise variants that fall within exome regions, a lot of the extra variants in whole genomes aren't actually useful for VQSR because they don't overlap the truth set.

    It's hard to even say how many variants you need because it depends on how the annotation features are distributed among them. If you have nice clear clustering on all dimensions, you'll need fewer variants than if they are very spread out. And you typically don't know that until you've run the program...

    Geraldine Van der Auwera, PhD

  • GustavGustav SwedenPosts: 12Member
    edited September 2014

    .

    Post edited by Gustav on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,173Administrator, GATK Dev admin

    Answered here: http://gatkforums.broadinstitute.org/discussion/comment/15888/#Comment_15888

    In future, please don't post the same question twice in different places. We see all the posts that go through and answer them regardless of where they are. When you post twice, you make our job harder and it adds noise to the forum.

    Geraldine Van der Auwera, PhD

  • SharonCoxSharonCox Posts: 7Member

    I used VariantRecalibrator AND ApplyRecalibration TOOLS SEPARATELY FOR INDELS AND SNPS . AT THIS POINT what tool should I use to have a single VCF , combinevariants? with which options

  • SheilaSheila Broad InstitutePosts: 1,689Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @SharonCox

    Hi,

    You can use CombineVariants with simply the required inputs. There is no need to use any optional parameters.

    However, in the future it may be easier to follow this tutorial:https://www.broadinstitute.org/gatk/guide/article?id=2805 which shows how to recalibrate SNPs and indels separately, but still end up with a single vcf.

    -Sheila

  • RebeccaRebecca Posts: 6Member

    Hi,

    if I have non-model organisms (bacterial populations) with no SNP databases, is it possible to use VQSR or does it make sence at all to use it? Or would it be more recommended to use hard filters to determine the accuracy of a variant?

  • tommycarstensentommycarstensen United KingdomPosts: 336Member ✭✭✭

    @Rebecca if you have a training set, then you can use VQSR. If not, then you can't.

  • RebeccaRebecca Posts: 6Member

    @tommycarstensen Do you mean by training set a dataset where I know what SNPs are reliable to build the calibration model? I guess then I cannot use it. Thanks for the help!

  • tommycarstensentommycarstensen United KingdomPosts: 336Member ✭✭✭

    @Rebecca Yup, you need a set of truth sites for any machine learning method. See the best practices for further information. Good luck. Have fun.

  • RebeccaRebecca Posts: 6Member
  • oussama_benhrifoussama_benhrif marocPosts: 5Member

    Hi,
    If i have one sample (Paired end) can i use VSQR ?
    Thanks

  • SheilaSheila Broad InstitutePosts: 1,689Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @oussama_benhrif
    Hi,

    If your sample is whole genome, then 1 sample is enough. However, for whole exomes we recommend using 30 or more samples for VQSR.

    -Sheila

  • oussama_benhrifoussama_benhrif marocPosts: 5Member

    thank you Sheila ,
    my sample is whole exome

  • oussama_benhrifoussama_benhrif marocPosts: 5Member
    edited April 30

    Hi @Sheila
    should i use a hard filter to replace VQSR ?

    Post edited by oussama_benhrif on
  • SheilaSheila Broad InstitutePosts: 1,689Member, GATK Dev, Broadie, Moderator, DSDE Dev admin
  • 5581681555816815 TNPosts: 12Member
    edited April 30

    I noticed that after I select SNP (SelectVariants = SNP) and INDEL (SelectVariants = INDEL) out to run VQSR separately, the total does not add up the the original (prior to selection).

    original:
    $ wc -l ceu_trio.noCombine.04142015.vcf
    6576494 ceu_trio.noCombine.04142015.vcf

    SNP:
    $ wc -l ceu_trio.noCombine.04142015.snp.vcf
    5438262 ceu_trio.noCombine.04142015.snp.vcf

    INDEL
    $ wc -l ceu_trio.noCombine.04142015.indel.vcf
    1125600 ceu_trio.noCombine.04142015.indel.vcf

    5438262 + 1125600 = 6563862 which is less than 6576494 (even though I am counting the headers twice, which is negligible compared to the amount of variant).

    Some variants are NOT selected and will be lost, should I recover them and what would be the recommended strategy?

    The only reason I can think of is the MNP was lost because I did not specify MNP while I am selecting SNPs. I had the impression that MNP will by default goes into INDEL (or, whatever is not a SNP will be selected as "INDEL") -- does this still holds?

    Thanks,

    Shuoguo

    Post edited by 55816815 on
  • SheilaSheila Broad InstitutePosts: 1,689Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @55816815
    Hi Shuoguo,

    You are missing the mixed records. The mixed records have both SNPs and Indels at the site. You can specify MIXED with -selectType https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#--selectTypeToInclude

    -Sheila

  • 5581681555816815 TNPosts: 12Member
    edited April 30

    @Sheila

    Thanks! But these variants should not go through VQSR steps, right?

    Shuoguo

    Post edited by 55816815 on
  • tommycarstensentommycarstensen United KingdomPosts: 336Member ✭✭✭

    @55816815 I'm a bit confused. The best practices do not require that you run SelectVariants. Instead you can just run VariantRecalibrator with -mode SNP and -mode INDEL on a file containing all of your called variants. See the best practices procedure above that suggest running VR and AR in sequence twice. I think this is what most people do.

    To answer your latest question. Your MIXED variants should indeed go through VQSR. They will be evaluated, when you run VR with -mode INDEL.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,173Administrator, GATK Dev admin

    It seems the best practice workflow is not entirely clear for everyone, but like @tommycarstensen says you don't need to separate out the records by type. This is explained in detail in the latest workshop presentationswhich you can find on the GATK blog (pending addition to the Presentations library).

    Geraldine Van der Auwera, PhD

  • 5581681555816815 TNPosts: 12Member

    @tommycarstensen

    Thanks! I think I am still doing the VQSR the "older way"? I always select out SNPs and INDELs into separate vcf and VQSR separately. Then merge them back once VQSR is done.

    I think the "newer" way of VQSR is easier. @Geraldine_VdAuwera I have gone through the newest workshop videos but mainly on variant call part to understand the new joint genotyping -- will finish all videos to appreciate all the good things you guys have done!

  • tommycarstensentommycarstensen United KingdomPosts: 336Member ✭✭✭

    @55816815 Personally I do it in a third way. I run two VR jobs for SNPs and indels simultaneously and then I apply the recalibration using either the unix utilities paste, sort -m and awk or Python and the function heapq.merge() on the .recal and .vcf.gz files. It's mostly because I'm impatient and want to avoid having to wait for two VR+AR sequential runs. Perhaps a future version of GATK will allow one to use multiple .recal files. That would be cool, except it would make my code redundant and make me look like an idiot for having written it :)

  • knightjimrknightjimr Yale UniversityPosts: 3Member

    I admit I'm confused by this part of the filtering. I am using the best practice VQSR process (with two VR+AR sequential runs), but still worry that the MIXED variants are not evaluated. Are they evaluated and filtered when -mode INDEL is set?

    Is there a best practice for the MIXED variants with hard filtering? I just found out that when I run the two SelectVariants+FilterVariants (for SNP and for INDEL), those MIXED variants are lost in the "filtered" output, and I haven't found any text describing the best practice filtering for MIXED variants.

    I just would like to ensure that, for VQSR and hard filtering, all of the raw variants are being evaluated and filtered, without any being dropped on the floor because they are not simple SNP's or simple INDEL's.

  • pdexheimerpdexheimer Posts: 483Member, GATK Dev, DSDE Dev mod

    @knightjimr

    As you may have noticed, the variant types for SelectVariants (INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION) are different than the modes for ApplyRecalibration (SNP, INDEL, BOTH). The VR/AR modes are much less precise, encompassing several variant types. Specifically, the SNP mode in AR/VR includes both SNPs and MNPs, and the INDEL mode includes INDEL, MIXED, and SYMBOLIC variants.

  • aneekaneek Posts: 29Member
    edited July 1

    Hi, while trying to use VariantRecalibrator tool, I got the following error message:

    "Bad input: Values for SQR annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations."

    Any suggestion about what this message means and how to solve this problem. I have used the same command lines as mentioned in this tutorial above.

    Thanks..

    P.S.- Sorry for this post. I have found my fault. I had given a wrong input as -an SQR. It should be -an SOR.

    Post edited by aneek on
  • SheilaSheila Broad InstitutePosts: 1,689Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @aneek
    Hi,

    I suspect you simply mistyped SOR as SQR.

    -Sheila

  • aneekaneek Posts: 29Member

    Hi, Sheila,

    Thank you for the correction. Another thing I would like to ask using -an InbreedingCoeff option is giving me an error like 'bad input.....'. From the FAQs (https://www.broadinstitute.org/gatk/guide/article?id=1259) I came to know that this option should used in case of atleast 10 samples. Since I am analyzing a single sample I think I should omit this option from the commandline. Please correct me if I am wrong.

  • aneekaneek Posts: 29Member

    Hi,

    I have an another query regarding VQSR function.

    For SNP recalibration when I have used the command mentioned in this tutorial it gave me an error message "No data found" with Error stack trace like

    ERROR stack trace

    java.lang.IllegalArgumentException: No data found.
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:408)
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:156)
    at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

    Then I found some information in this link: http://gatkforums.broadinstitute.org/discussion/3067/variantrecalibrator-unable-to-retrieve-result and added "--maxGaussians 4" option in my commandline since I am doing analysis for a single sample and it gave me the result perfectly.

    After that when I started INDEL recalibration using the above mentioned commands in this tutorials it gave me the same error messages which I have given above even after adding the "--maxGaussians 4" option.

    Then I reduced the value of this option to 1 (i.e. --maxGaussians 1) and retried the same commandline. It gave me the ouput smoothly without any errors.

    So in short is that means since I am using a single sample for the analysis I should always use --maxGaussians 1 instead of --maxGaussians 4 in both SNP and INDEL recalibration?

    If so then I suppose if the number of sample increases the value of --maxGaussians parameter should also be increased (like for 2 samples "--maxGaussians 2", for 3 samples "--maxGaussians 3" etc.).

    Another thing is should I also use "--validation_strictness LENIENT" parameter also for lower sample variant files?

    Please suggest.

    Thanks.

  • SheilaSheila Broad InstitutePosts: 1,689Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @aneek
    Hi,

    Yes, inbreeding coefficient should not be used on a single sample.
    Are your samples whole exome or whole genome? We recommend using at least 30 whole exome samples in VQSR to get good results. However, if you are using a whole genome sample, 1 sample is enough.

    -Sheila

  • aneekaneek Posts: 29Member

    Hi Sheila,
    I am currently using a single sample of whole exome. Actually we receive exome sequencing data for a single sample at a time for genetic testing purpose and hence we have to do GATK analysis for a single exome only. It is not possible for us to wait till 30 exome samples get accumulated otherwise we will be late to give reports to the patients. Please suggest whether the VQSR analysis parameters I have mentioned in my earlier query is ok for a single exome analysis and my understanding about the "--maxGaussians" function.

    If not then kindly recommend the parameters I should use for this analysis to get atleast a fair result for a single whole exome sample.

    Thanks..

  • SheilaSheila Broad InstitutePosts: 1,689Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @aneek
    Hi,

    We do not recommend using VQSR on less than 30 whole exome samples.

    Do you have samples from previous tests that you can use? The GVCF workflow is designed for cases like yours. You simply create a GVCF for each sample as you get it, then combine the GVCFs with GenotypeGVCFs each time you add a new sample. https://www.broadinstitute.org/gatk/guide/article?id=4150
    https://www.broadinstitute.org/gatk/guide/article?id=3893

    You can also use data from the 1000Genomes project to add more data to your VQSR input. You can get bam files from samples related to yours and run the pipeline on them. Please see this thread for more information: http://gatkforums.broadinstitute.org/discussion/3612/1000genomes-exomes-and-vqsr

    Lastly, if you do not have samples from previous tests or you do not want to use 1000Genomes data, you can use hard filtering for the single exome. http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set

    -Sheila

  • aneekaneek Posts: 29Member

    Hi Sheila,

    Thank you very much for the suggestions. If I use the hard filtering option you have mentioned here it is generating the filtered SNP and INDEL files seperately. Now after this if I want to combine both of these files into a single vcf file then which command should I use?

    Also can you elaborate the usage of "--maxGaussians" function a little bit. What this function does and why should we use only --maxGaussians 4 or more and not --maxGaussians 1, 2 or 3 (Although --maxGaussians 1, 2 or 3 works well in the commandline)?

    Thanks..

  • SheilaSheila Broad InstitutePosts: 1,689Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @aneek
    Hi,

    You can use Combine Variants to merge the two filtered vcfs.

    Have a look at this article under "Can I use the variant quality score recalibrator with my small sequencing experiment?" for more information on maxGaussians. https://www.broadinstitute.org/gatk/guide/article?id=39

    I hope this helps.

    -Sheila

  • aneekaneek Posts: 29Member

    Hi Sheila,
    Thanks a lot for the information..

  • knightjimrknightjimr Yale UniversityPosts: 3Member

    @pdexheimer Thanks for the response. I had seen the documentation on that, but that difference I think adds to the confusion.

    What I really want to know, in English, is "Does each variant called by GenotypeGVCFs get assigned a PASS or non-PASS value by the filtering process, and ideally in such a way so that I can create a single VCF file containing those calls with their assigned value".

    For the best practices hard filtering process, the answer to that question is false. I have single sample exome callsets where some raw VCF lines have a "G A,AT ... 1/2:..." call (with 1/2 as the genotype). Those lines don't get included by the SNP or INDEL filtering in the best practices, and in effect, they disappear. In my test, if I also do a SelectVariants on MIXED for my dataset, I get the remaining variants. But, there is no best practices for filtering those, and I also don't know for sure that that will be all variants for all callsets (it worked in my dataset, but does that cover all of the variants or not). (And, if you would like me to move this question to a hard filtering specific forum page, I'd be happy to do that).

    For VQSR, my question is, does the resulting VCF file, after going through the best practices process, have every variant call that was produced by GenotypeGVCFs?

  • pdexheimerpdexheimer Posts: 483Member, GATK Dev, DSDE Dev mod

    @knightjimr - Yes. The VQSR INDEL mode explicitly includes MIXED variants.

  • knightjimrknightjimr Yale UniversityPosts: 3Member

    Thanks. Given my surprise with hard filtering, I just wanted to double check.

Sign In or Register to comment.