The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

(howto) Recalibrate variant quality scores = run VQSR

edited December 2014

Objective

Recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.

• TBD

Caveats

This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document. See that document also for caveats regarding exome vs. whole genomes analysis design.

Steps

1. Prepare recalibration parameters for SNPs
a. Specify which call sets the program should use as resources to build the recalibration model
b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

2. Build the SNP recalibration model

3. Apply the desired level of recalibration to the SNPs in the call set

4. Prepare recalibration parameters for Indels
a. Specify which call sets the program should use as resources to build the recalibration model
b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

5. Build the Indel recalibration model

6. Apply the desired level of recalibration to the Indels in the call set

1. Prepare recalibration parameters for SNPs

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

• True sites training resource: HapMap

This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

• True sites training resource: Omni

This resource is a set of polymorphic SNP sites produced by the Omni genotyping array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

• Non-true sites training resource: 1000G

This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this resource may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%).

• Known sites resource, not used in training: dbSNP

This resource is a SNP call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of SNPs being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Estimation of the overall mapping quality of reads supporting a variant call.

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

• First tranche threshold 100.0

• Second tranche threshold 99.9

• Third tranche threshold 99.0

• Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

2. Build the SNP recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R reference.fa \
-input raw_variants.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \
-an DP \
-an QD \
-an FS \
-an SOR \
-an MQ \
-an MQRankSum \
-an InbreedingCoeff \
-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
-recalFile recalibrate_SNP.recal \
-tranchesFile recalibrate_SNP.tranches \
-rscriptFile recalibrate_SNP_plots.R


Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_SNP.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_SNP.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the VQSR method documentation and presentation videos.

3. Apply the desired level of recalibration to the SNPs in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R reference.fa \
-input raw_variants.vcf \
-mode SNP \
--ts_filter_level 99.0 \
-recalFile recalibrate_SNP.recal \
-tranchesFile recalibrate_SNP.tranches \
-o recalibrated_snps_raw_indels.vcf


Expected Result

This creates a new VCF file, called recalibrated_snps_raw_indels.vcf, which contains all the original variants from the original raw_variants.vcf file, but now the SNPs are annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.

4. Prepare recalibration parameters for Indels

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

• Known and true sites training resource: Mills

This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

• First tranche threshold 100.0

• Second tranche threshold 99.9

• Third tranche threshold 99.0

• Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

• Maximum number of Gaussians (-maxGaussians) 4

This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.

5. Build the Indel recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R reference.fa \
-input recalibrated_snps_raw_indels.vcf \
-resource:mills,known=true,training=true,truth=true,prior=12.0 mills.vcf \
-an QD \
-an DP \
-an FS \
-an SOR \
-an MQRankSum \
-an InbreedingCoeff
-mode INDEL \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
--maxGaussians 4 \
-recalFile recalibrate_INDEL.recal \
-tranchesFile recalibrate_INDEL.tranches \
-rscriptFile recalibrate_INDEL_plots.R


Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_INDEL.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_INDEL.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.

6. Apply the desired level of recalibration to the Indels in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T ApplyRecalibration \
-R reference.fa \
-input recalibrated_snps_raw_indels.vcf \
-mode INDEL \
--ts_filter_level 99.0 \
-recalFile recalibrate_INDEL.recal \
-tranchesFile recalibrate_INDEL.tranches \
-o recalibrated_variants.vcf


Expected Result

This creates a new VCF file, called recalibrated_variants.vcf, which contains all the original variants from the original recalibrated_snps_raw_indels.vcf file, but now the Indels are also annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• LondonPosts: 9Member

I would like to ask for a reconfirmation that DP (coverage) should not be used in VQSR.
This is explained in "discussion/1259/what-vqsr-training-sets-arguments-should-i-use-for-my-specific-project". However, comment in this page in Jan 8 said "same parameters as shown in the above section".

(pardon if I have sent this twice as I cannot see my post after the first upload)

Kin

Hi @KinMok‌,

As explained in the document about VQSR training sets and arguments, DP should be used for whole genomes but not for exomes.

Also, note the caveat at the start of this document:

This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document.

Geraldine Van der Auwera, PhD

• IsraelPosts: 6Member

Hi @Geraldine_VdAuwera,
1.Can you explain me why do I need a large number of samples to use VSQR? aren't the samples independent?
2.does it matter if the samples are dependent (same family, same rare disease) or independent (random samples from random sources)?
3.is there a minimum number of samples which i can use VSQR when using exome analysis and whole genome analysis.

Thanks,
Yaron

• Posts: 542Member, Dev ✭✭✭✭

@yeinhorn‌ - The key requirement is really number of variants, not number of samples. It's just that it's easier to add more samples than it is to increase variation...

My understanding (which might be incomplete/wrong) is that the major underlying requirement is that you have a sufficient number of variants that overlap your set of "true" variants, so that the Gaussian model can be effectively constructed. You also need enough variants to have a meaningful "bad" set, which is also used in cluster discrimination.

So yes, it does matter if your samples are from the same family, as they'll have a fewer number of total variants between them. The same rare disease shouldn't impact anything, since they should only have a couple of variants in common. I don't think the sample requirements are hugely different between exomes and whole genomes, I try really hard not to use fewer than 30 samples. You'll see improved performance with more samples, I'm not sure where diminishing returns kick in.

To add only a tiny bit to @pdexheimer 's excellent answer, you typically need somewhat fewer whole genomes compared to whole exomes because the former contain more variants than the latter. However, because the known datasets used as truth sets tend to mostly comprise variants that fall within exome regions, a lot of the extra variants in whole genomes aren't actually useful for VQSR because they don't overlap the truth set.

It's hard to even say how many variants you need because it depends on how the annotation features are distributed among them. If you have nice clear clustering on all dimensions, you'll need fewer variants than if they are very spread out. And you typically don't know that until you've run the program...

Geraldine Van der Auwera, PhD

• SwedenPosts: 12Member
edited September 2014

.

Post edited by Gustav on

In future, please don't post the same question twice in different places. We see all the posts that go through and answer them regardless of where they are. When you post twice, you make our job harder and it adds noise to the forum.

Geraldine Van der Auwera, PhD

• Posts: 7Member

I used VariantRecalibrator AND ApplyRecalibration TOOLS SEPARATELY FOR INDELS AND SNPS . AT THIS POINT what tool should I use to have a single VCF , combinevariants? with which options

@SharonCox‌

Hi,

You can use CombineVariants with simply the required inputs. There is no need to use any optional parameters.

However, in the future it may be easier to follow this tutorial:https://www.broadinstitute.org/gatk/guide/article?id=2805 which shows how to recalibrate SNPs and indels separately, but still end up with a single vcf.

-Sheila

• Posts: 6Member

Hi,

if I have non-model organisms (bacterial populations) with no SNP databases, is it possible to use VQSR or does it make sence at all to use it? Or would it be more recommended to use hard filters to determine the accuracy of a variant?

• United KingdomPosts: 400Member ✭✭✭

@Rebecca if you have a training set, then you can use VQSR. If not, then you can't.

• Posts: 6Member

@tommycarstensen Do you mean by training set a dataset where I know what SNPs are reliable to build the calibration model? I guess then I cannot use it. Thanks for the help!

• United KingdomPosts: 400Member ✭✭✭

@Rebecca Yup, you need a set of truth sites for any machine learning method. See the best practices for further information. Good luck. Have fun.

• Posts: 6Member
• marocPosts: 5Member

Hi,
If i have one sample (Paired end) can i use VSQR ?
Thanks

If your sample is whole genome, then 1 sample is enough. However, for whole exomes we recommend using 30 or more samples for VQSR.

-Sheila

• marocPosts: 5Member

thank you Sheila ,
my sample is whole exome

• marocPosts: 5Member
edited April 2015

Hi @Sheila
should i use a hard filter to replace VQSR ?

Post edited by oussama_benhrif on
• TNPosts: 22Member
edited April 2015

I noticed that after I select SNP (SelectVariants = SNP) and INDEL (SelectVariants = INDEL) out to run VQSR separately, the total does not add up the the original (prior to selection).

original:
$wc -l ceu_trio.noCombine.04142015.vcf 6576494 ceu_trio.noCombine.04142015.vcf SNP:$ wc -l ceu_trio.noCombine.04142015.snp.vcf
5438262 ceu_trio.noCombine.04142015.snp.vcf

INDEL
$wc -l ceu_trio.noCombine.04142015.indel.vcf 1125600 ceu_trio.noCombine.04142015.indel.vcf 5438262 + 1125600 = 6563862 which is less than 6576494 (even though I am counting the headers twice, which is negligible compared to the amount of variant). Some variants are NOT selected and will be lost, should I recover them and what would be the recommended strategy? The only reason I can think of is the MNP was lost because I did not specify MNP while I am selecting SNPs. I had the impression that MNP will by default goes into INDEL (or, whatever is not a SNP will be selected as "INDEL") -- does this still holds? Thanks, Shuoguo Post edited by 55816815 on • Broad InstitutePosts: 3,754Member, Broadie, Moderator, Dev admin @55816815 Hi Shuoguo, You are missing the mixed records. The mixed records have both SNPs and Indels at the site. You can specify MIXED with -selectType https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#--selectTypeToInclude -Sheila • TNPosts: 22Member edited April 2015 @Sheila Thanks! But these variants should not go through VQSR steps, right? Shuoguo Post edited by 55816815 on • United KingdomPosts: 400Member ✭✭✭ @55816815 I'm a bit confused. The best practices do not require that you run SelectVariants. Instead you can just run VariantRecalibrator with -mode SNP and -mode INDEL on a file containing all of your called variants. See the best practices procedure above that suggest running VR and AR in sequence twice. I think this is what most people do. To answer your latest question. Your MIXED variants should indeed go through VQSR. They will be evaluated, when you run VR with -mode INDEL. • Posts: 10,395Administrator, Dev admin It seems the best practice workflow is not entirely clear for everyone, but like @tommycarstensen says you don't need to separate out the records by type. This is explained in detail in the latest workshop presentationswhich you can find on the GATK blog (pending addition to the Presentations library). Geraldine Van der Auwera, PhD • TNPosts: 22Member @tommycarstensen Thanks! I think I am still doing the VQSR the "older way"? I always select out SNPs and INDELs into separate vcf and VQSR separately. Then merge them back once VQSR is done. I think the "newer" way of VQSR is easier. @Geraldine_VdAuwera I have gone through the newest workshop videos but mainly on variant call part to understand the new joint genotyping -- will finish all videos to appreciate all the good things you guys have done! • United KingdomPosts: 400Member ✭✭✭ @55816815 Personally I do it in a third way. I run two VR jobs for SNPs and indels simultaneously and then I apply the recalibration using either the unix utilities paste, sort -m and awk or Python and the function heapq.merge() on the .recal and .vcf.gz files. It's mostly because I'm impatient and want to avoid having to wait for two VR+AR sequential runs. Perhaps a future version of GATK will allow one to use multiple .recal files. That would be cool, except it would make my code redundant and make me look like an idiot for having written it • Yale UniversityPosts: 3Member I admit I'm confused by this part of the filtering. I am using the best practice VQSR process (with two VR+AR sequential runs), but still worry that the MIXED variants are not evaluated. Are they evaluated and filtered when -mode INDEL is set? Is there a best practice for the MIXED variants with hard filtering? I just found out that when I run the two SelectVariants+FilterVariants (for SNP and for INDEL), those MIXED variants are lost in the "filtered" output, and I haven't found any text describing the best practice filtering for MIXED variants. I just would like to ensure that, for VQSR and hard filtering, all of the raw variants are being evaluated and filtered, without any being dropped on the floor because they are not simple SNP's or simple INDEL's. • Posts: 542Member, Dev ✭✭✭✭ @knightjimr As you may have noticed, the variant types for SelectVariants (INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION) are different than the modes for ApplyRecalibration (SNP, INDEL, BOTH). The VR/AR modes are much less precise, encompassing several variant types. Specifically, the SNP mode in AR/VR includes both SNPs and MNPs, and the INDEL mode includes INDEL, MIXED, and SYMBOLIC variants. • Posts: 73Member edited July 2015 Hi, while trying to use VariantRecalibrator tool, I got the following error message: "Bad input: Values for SQR annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations." Any suggestion about what this message means and how to solve this problem. I have used the same command lines as mentioned in this tutorial above. Thanks.. P.S.- Sorry for this post. I have found my fault. I had given a wrong input as -an SQR. It should be -an SOR. Post edited by aneek on • Broad InstitutePosts: 3,754Member, Broadie, Moderator, Dev admin @aneek Hi, I suspect you simply mistyped SOR as SQR. -Sheila • Posts: 73Member Hi, Sheila, Thank you for the correction. Another thing I would like to ask using -an InbreedingCoeff option is giving me an error like 'bad input.....'. From the FAQs (https://www.broadinstitute.org/gatk/guide/article?id=1259) I came to know that this option should used in case of atleast 10 samples. Since I am analyzing a single sample I think I should omit this option from the commandline. Please correct me if I am wrong. • Posts: 73Member Hi, I have an another query regarding VQSR function. For SNP recalibration when I have used the command mentioned in this tutorial it gave me an error message "No data found" with Error stack trace like ERROR stack trace java.lang.IllegalArgumentException: No data found. at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:408) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:156) at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)

Then I found some information in this link: http://gatkforums.broadinstitute.org/discussion/3067/variantrecalibrator-unable-to-retrieve-result and added "--maxGaussians 4" option in my commandline since I am doing analysis for a single sample and it gave me the result perfectly.

After that when I started INDEL recalibration using the above mentioned commands in this tutorials it gave me the same error messages which I have given above even after adding the "--maxGaussians 4" option.

Then I reduced the value of this option to 1 (i.e. --maxGaussians 1) and retried the same commandline. It gave me the ouput smoothly without any errors.

So in short is that means since I am using a single sample for the analysis I should always use --maxGaussians 1 instead of --maxGaussians 4 in both SNP and INDEL recalibration?

If so then I suppose if the number of sample increases the value of --maxGaussians parameter should also be increased (like for 2 samples "--maxGaussians 2", for 3 samples "--maxGaussians 3" etc.).

Another thing is should I also use "--validation_strictness LENIENT" parameter also for lower sample variant files?

Thanks.

@aneek
Hi,

Yes, inbreeding coefficient should not be used on a single sample.
Are your samples whole exome or whole genome? We recommend using at least 30 whole exome samples in VQSR to get good results. However, if you are using a whole genome sample, 1 sample is enough.

-Sheila

• Posts: 73Member

Hi Sheila,
I am currently using a single sample of whole exome. Actually we receive exome sequencing data for a single sample at a time for genetic testing purpose and hence we have to do GATK analysis for a single exome only. It is not possible for us to wait till 30 exome samples get accumulated otherwise we will be late to give reports to the patients. Please suggest whether the VQSR analysis parameters I have mentioned in my earlier query is ok for a single exome analysis and my understanding about the "--maxGaussians" function.

If not then kindly recommend the parameters I should use for this analysis to get atleast a fair result for a single whole exome sample.

Thanks..

@aneek
Hi,

We do not recommend using VQSR on less than 30 whole exome samples.

Do you have samples from previous tests that you can use? The GVCF workflow is designed for cases like yours. You simply create a GVCF for each sample as you get it, then combine the GVCFs with GenotypeGVCFs each time you add a new sample. https://www.broadinstitute.org/gatk/guide/article?id=4150

You can also use data from the 1000Genomes project to add more data to your VQSR input. You can get bam files from samples related to yours and run the pipeline on them. Please see this thread for more information: http://gatkforums.broadinstitute.org/discussion/3612/1000genomes-exomes-and-vqsr

Lastly, if you do not have samples from previous tests or you do not want to use 1000Genomes data, you can use hard filtering for the single exome. http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set

-Sheila

• Posts: 73Member

Hi Sheila,

Thank you very much for the suggestions. If I use the hard filtering option you have mentioned here it is generating the filtered SNP and INDEL files seperately. Now after this if I want to combine both of these files into a single vcf file then which command should I use?

Also can you elaborate the usage of "--maxGaussians" function a little bit. What this function does and why should we use only --maxGaussians 4 or more and not --maxGaussians 1, 2 or 3 (Although --maxGaussians 1, 2 or 3 works well in the commandline)?

Thanks..

@aneek
Hi,

You can use Combine Variants to merge the two filtered vcfs.

I hope this helps.

-Sheila

• Posts: 73Member

Hi Sheila,
Thanks a lot for the information..

• Yale UniversityPosts: 3Member

@pdexheimer Thanks for the response. I had seen the documentation on that, but that difference I think adds to the confusion.

What I really want to know, in English, is "Does each variant called by GenotypeGVCFs get assigned a PASS or non-PASS value by the filtering process, and ideally in such a way so that I can create a single VCF file containing those calls with their assigned value".

For the best practices hard filtering process, the answer to that question is false. I have single sample exome callsets where some raw VCF lines have a "G A,AT ... 1/2:..." call (with 1/2 as the genotype). Those lines don't get included by the SNP or INDEL filtering in the best practices, and in effect, they disappear. In my test, if I also do a SelectVariants on MIXED for my dataset, I get the remaining variants. But, there is no best practices for filtering those, and I also don't know for sure that that will be all variants for all callsets (it worked in my dataset, but does that cover all of the variants or not). (And, if you would like me to move this question to a hard filtering specific forum page, I'd be happy to do that).

For VQSR, my question is, does the resulting VCF file, after going through the best practices process, have every variant call that was produced by GenotypeGVCFs?

• Posts: 542Member, Dev ✭✭✭✭

@knightjimr - Yes. The VQSR INDEL mode explicitly includes MIXED variants.

• Yale UniversityPosts: 3Member

Thanks. Given my surprise with hard filtering, I just wanted to double check.

• GreensboroPosts: 58Member

I have finished recalibrating my bam files (whole genome, 6 samples). My all sites vcf calls for 1 sample is taking more than 3 days. Since, my model organism doesnot have a truth, training, knowsites data of variants, I wanted to ask if it is just best to do 1) hard filtering of SNPs and INdels 2) hard filtering on all samples and collect the common variants which can then be used for VQSR process as truth-training table.
If anyone had come across the same/similar problem please let me know. How much benefit can I gain by just doing extra VQSR process; I am asking this because at some point I have to call "time out" on my data analysis.

• GreensboroPosts: 58Member

Hi @Geraldine_VdAuwera : Additionally question came up. So, I want to call variants on the reclibrated bam files and apply the strict filter. I can call variant for different samples from with in the same populations using:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \
-R lyrata_genome.fa -I recalibSAMPLE01.bam -I recalibSAMPLE2 -I recalibSAMPLE3\
--genotyping_mode DISCOVERY -stand_emit_conf 30 -stand_call_conf 50 \
-o confHCVariants.vcf
(as I have more than 1 sample actually 6 sample per population).

If I happen to call the variants separately on each reclibrated bam file and then merge the vcf files (from all samples) using:
java -jar GenomeAnalysisTK.jar \
-T CombineVariants \
-R reference.fasta \
--variant recalibSAMPLE1.vcf \
--variant recalibSAMPLE2.vcf \
--variant recalibSAMPLE3.vcf \
-o merged.vcf \
-genotypeMergeOptions UNIQUIFY

Would the confHCVariants.vcf and merged.vcf file be equivalent. If there is any difference what could it be? I could run both the analyses and compare it by myself but my variant calling for just 1 sample is taking so long that I have to just go ahead and ask. If its the the same I plan to go with the former option.

Thanks,

-Sheila

@everestial007 No, it's not equivalent. The correct way to do it is explained in the Best Practices documentation.

To clarify @Sheila's comment, you need to apply a joint discovery method, which can be done in two ways:

1) provide all inputs together to HC, as in your first example;
2) run HC on each sample alone in -ERC GVCF mode to produce a .g.vcf file for each, then run GenotypeGVCFs on all .g.vcf files together.

These two ways are equivalent; the first way is how people used to do it, and scales badly with sample numbers, while the second is the new way to do it which scales much better.

Geraldine Van der Auwera, PhD

• GreensboroPosts: 58Member

I have prepared set of highly confident variant by unifying the variants from several samples then selecting the common variants (at least among 3 samples) using hard filters. I now plan to use these variants for VQSR. However, I am encountering some issue that is not discussed anywhere in the forum.

I used the following command:
java -Xmx4g -jar /home/everestial007/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T VariantRecalibrator -R lyrata_genome.fa -input genotypedGvcfVARiantsMA.vcf -resource:strictSelection,known=false,training=true,truth=false,prior=10.0 VARiantsMAYODANsnp.vcf -an DP -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile raw.SNPs.recal -tranchesFile raw.SNPs.tranches -rscriptFile recalSNPs.plots.R

-where, genotypedGvcfVARiantsMA.vcf is the vcf called from several *.g.vcf samples using genotypeGVCF
VARiantsMAYODANsnp.vcf is the highly confident variants for a new population of one of the model organism I am working with; it was prepared by unifying the variants from several samples then selecting the common variants (at least among 3 samples) using hard filters.

I have followed the provided direction on VQSR page to include appropriate parameters/flags but getting error message. I am posting the part of the message that is mostly concerning:

ERROR MESSAGE: Invalid command line: No truth set found! Please provide sets of known polymorphic loci marked with the truth=true ROD binding tag. For example, -resource:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmapFile.vcf

Isn't the truth set the one I provided. i.e VARiantsMAYODANsnp.vcf

Looks like I need to manipulate the vcf file and add ROD binding tag containing the information discussed under "For example".

You need at least one set that has the property truth=true.

Geraldine Van der Auwera, PhD

• GreensboroPosts: 58Member

@Geraldine_VdAuwera said:
You need at least one set that has the property truth=true.

Great, it worked. Its so annoying and surprising at the same time that sometimes such a tiny changes make everything work. Thanks a lot @Geraldine_VdAuwera for all the help you have been. Wish you "Happy Holidays" !!!

Ah, but that tiny change is what determines how the data is used by the statistical model... That's pretty important

Happy holidays to you too!

Geraldine Van der Auwera, PhD

• IsraelPosts: 6Member
edited January 6

Hi @Geraldine_VdAuwera,
I usually get single exome each time from a different lab which uses a different sequencing machine and different coverage.
I want to use VSQR in order to get good filtration, but since 1 exome isn't enough to build a reliable model there's an option to pad it with other exomes (like suggested for example using 1kg or exac data, or i can use my own other exomes).
But if use different sequencers/coverage/labs wouldn't it dramatically change the results of the model? for example, maybe for all the 1kg samples for FisherStrand score less than value 6 it's false positive while for my exome it's only less than 12 .
Is still suggested to run the exome with other type of data to have many variants, or maybe I should use hard filters in this case? Is there a preferable data i should use (my data/1kg/exac)?
Or third option, maybe i should still try to use the VSQR for one exome, and maybe use a smaller number of Gaussians?
Thanks!

Post edited by yeinhorn on

Having so much technical heterogeneity is definitely going to complicate things by muddling the tool's ability to build an appropriate model. It's hard to say if it will be so bad as to make the results worse than hard-filtering. What you can try is to plot the distribution of annotation values in your data vs the 1kg samples, and see how much divergence there is. You could decide to exclude annotations where you see dramatic differences, but still run VQSR with annotations for which the distribution is fairly similar.

Geraldine Van der Auwera, PhD

• USAPosts: 8Member

Hi @Geraldine_VdAuwera
I am working with many whole genome sequencing of dogs (different breeds, so I know we have a lot of polymorphism). I want to use the VQSR but of course we lack the human resource of known True sites like HapMap and Omni. Can I develop my own resource by applying some VERY strict hard filters to define those true sites? We can also repeat the VQSR couple time in iterative way where we update the true variants with those variants having high VQSR score?
If you think this suggestion is applicable (or at least worth trying it), can you suggest some hard filter cutoffs to start with?

P.S. How the good model plot should look like !!?

Thank you

I am working on documents describing hard filtering, but it will be a little while until they are fully complete. In the meantime, you can start out with the hard filtering recommendations for humans and refine them to fit your data. As for the plots, you can plot each of the annotations you are filtering on and see what they distribution looks like. It will be different for different datasets.

-Sheila

• USAPosts: 8Member

@Sheila
Thanks for the response. What I understand from Geraldine's response is that we can apply a set of stringent hard filters to identify a set of variants that we can use as a "truth". For both SNPs and indels, I used these stringent filters "QD < 6.0 || MQ < 60.0 || MQRankSum < -1.0 || FS > 20.0 || SOR > 2.0 || ReadPosRankSum < -1.0"

My question is, when we use this set of variants as the truth to train the model, are not we forcing the model to set high thresholds?

To some extent. It may limit the tool's ability to identify valid "minority" profiles, and if you're using the stringent subset as sensitivity target, then that affects the results you get for a particular level of sensitivity. But you can always lower the value and retrieve more variants. And it's still better than flat-out hard filtering.

Geraldine Van der Auwera, PhD

• UKPosts: 20Member

Hello,
Just wanted to confirm some diffs I see here compared to when I last looked at Best Practices 6+ months ago:

• BQSR now includes "Mills_and_1000G_gold_standard.indels"
• VQSR tranche now set to 99.9 (from 99.5) for SNPs
• dbSNP now removed from indel VQSR
• VQSR tranche now set to 99.9 (from 99.0) for Indels

Thanks a lot!

Issue · Github August 22 by Sheila

Issue Number
1199
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

@ekanterakis
Hi,

I just asked Geraldine to confirm. She will get back to you soon.

-Sheila

@ekanterakis If I recall correctly those were documentation fixes that were made to harmonize what we had across several documents (there were some mismatches).

Geraldine Van der Auwera, PhD

• UKPosts: 20Member

So those changes reflect the current Best Practices? As a general comment, I'm finding it had to get a full picture of what a Best Practices workflow looks like. I can find bits and pieces by clicking through various pages but I'm never sure if those pages are up to date because they've been around for a while. It would be much more helpful if there was a Best Practices protocol attached to each release. Is there such a resource? I'm looking for the latest parameters (GATK 3.6) for BQSR, HC and VQSR for whole genome. Thank you!