Service note: Geraldine is on vacation this week; other members of GSA will be responding to questions, but they have a lot of work besides this, so be aware that responses may be a little slower than usual. Thank you for your patience.

What VQSR training sets / arguments should I use for my specific project?

rpoplinrpoplin Posts: 91GSA Official Member mod

VariantRecalibrator

For use with calls generated by the UnifiedGenotyper and HaplotypeCaller

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

Common, base command line

java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R path/to/reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -recalFile path/to/output.recal \
   -tranchesFile path/to/output.tranches \
   -nt 4 \
   [SPECIFY TRUTH AND TRAINING SETS] \
   [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
   [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \

SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle. Arguments for VariantRecalibrator command:

   -percentBad 0.01 -minNumBad 1000 \
   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
   -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
   -an QD -an MQRankSum -an ReadPosRankSum -an FS -an DP \
   -mode SNP \

Note that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

Important notes about annotations

Some of these annotations might not be the best for your particular dataset. For example, InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. If your study design has more than 10 samples then it is recommended to be included.

Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with hybrid capture datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.

Additionally, the UnifiedGenotyper produces a statistic called the HaplotypeScore which should be used for SNPs. This statistic isn't necessary for the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes.

Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

  • Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline)
  • Use the VQSR with the smaller variant callset but experiment with the precise argument settings (try adding --maxGaussians 4 --percentBad 0.05 to your command line, for example)

Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle. Arguments for VariantRecalibrator:

   --maxGaussians 4 -percentBad 0.01 -minNumBad 1000 \
   -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
   -an DP -an FS -an ReadPosRankSum -an MQRankSum \
   -mode INDEL \

Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. If your study design has more than 10 samples then it is recommended to be included.

ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

For use with calls generated by the UnifiedGenotyper and HaplotypeCaller

Common, base command line

 
 java -Xmx3g -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -tranchesFile path/to/input.tranches \
   -recalFile path/to/input.recal \
   -o path/to/output.recalibrated.filtered.vcf \
   [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
   [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \
 

SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. The default recommendation is to achieve 99.9% sensitivity to the accessible truth sites. Naturally projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity.

   --ts_filter_level 99.9 \
   -mode SNP \

Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. The default recommendation is to achieve 99.9% sensitivity to the accessible truth sites. Naturally projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity.

   --ts_filter_level 99.9 \
   -mode INDEL \
Post edited by Geraldine_VdAuwera on

Comments

  • ericminikelericminikel Posts: 23Member

    I have a very small dataset (12 samples, only ~80K variants) and I found that even using the Mills gold standard indel resource, I was still unable to run VariantRecalibrator on indels. Is there any solution to this?

    My call:

    # create model for INDELs
    java -Xmx8g -jar ~/bin/gatk-2.1-5/GenomeAnalysisTK.jar \
        -T VariantRecalibrator \
        -R gatk-bundle-1.5-hg19/ucsc.hg19.fasta \
        -input variants.vcf \
        --maxGaussians 4 -std 10.0 -percentBad 0.05 \
        -resource:mills,known=true,training=true,truth=true,prior=12.0 gatk-bundle-1.5-hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
        -an QD -an FS -an HaplotypeScore -an ReadPosRankSum -an InbreedingCoeff \
        -mode INDEL \
        -recalFile variants.indels.recal \
        -tranchesFile variants.indels.tranches
    

    Some of the output:

    INFO  19:18:00,552 VariantRecalibratorEngine - Convergence after 12 iterations! 
    INFO  19:18:00,554 VariantRecalibratorEngine - Evaluating full set of 5570 variants... 
    INFO  19:18:01,275 GATKRunReport - Uploaded run statistics report to AWS S3 
    
    ##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    Hi Eric,

    Have you looked at the VQSR doc in Methods & Workflows? At the end there is a small collection of FAQs, one question being that of small datasets:

    Can I use the variant quality score recalibrator with my small sequencing experiment? This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties. One piece of advice is to turn down the number of Gaussians used during training and to turn up the number of variants that are used to train the negative model. This can be accomplished by adding --maxGaussians 4 --percentBad 0.05 to your command line.

    Let me know if that little tweak still doesn't help.

    Geraldine Van der Auwera, PhD

  • akbariakbari Posts: 6Member

    Where can I find the raw vcf for 1000G exomes to be used in combination with my small number of exomes for variantrecalibrator? If the raw vcfs are not available for 1000G exomes and I should call the variants from the BAM files myself, can I use the 1000G exomes' vcf called by Unifiedgenotyper with our own samples' vcfs called by haplotypecaller for variantrecalibrator? Thanks.

  • ebanksebanks Posts: 478GSA Official Member mod

    The recommendations say "Add additional samples for variant calling" not for recalibration. In other words you need to do the calling simultaneously with your own samples plus the 1000 Genomes bams. You cannot just merge together VCFs.

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • akbariakbari Posts: 6Member

    Thanks for your answer. But it might not be practical to call variants for extra 20 1000G samples (to have the 30 in total) each time specially with haplotypecaller. It will take forever.

  • ebanksebanks Posts: 478GSA Official Member mod

    We are actively working to speed up the run time of the Haplotype Caller. In the meantime, if run time is a problem for you, then you should use the Unified Genotyper.

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    In fact the HaplotypeCaller's runtime has been significantly improved already in the new release -- check out this overview:

    http://www.broadinstitute.org/gatk/guide/version-history

    Geraldine Van der Auwera, PhD

  • akbariakbari Posts: 6Member

    Thanks for your note. I just downloaded the 2.2 release and I am running the haplotypecaller now and I can attest that it is at least 2 fold faster on my system. But what amazed me the most is the multi threading in baserecalibrator. I ran it with 8 cores and it is amazingly fast. Thumbs up for the team who worked on these! Do you know if multi threading could be used in this release for indelrealigner as well?

    About using Unified Genotyper, I should say that I can not go back to it after using haplotypecaller and seeing the large indels it picks and UG misses them, specially after confirming them with Sanger. I just feel guilty if I want to use UG again!

    Geraldine_VdAuwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    The team is very pleased to hear that you like the improvements :)

    Unfortunately you can't use multi-threading with the IndelRealigner at this time, but you can parallelize it with scatter-gather (Queue). We'll have a new document out very soon detailing the parallelism options available for all the main tools.

    Geraldine Van der Auwera, PhD

  • Mark_DePristoMark_DePristo Posts: 132Administrator, GSA Official Member admin

    Hi Akbari, great to hear you are enjoying GATK 2.2! Any chance you could share some screenshots of those indels with us?

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • ik5ik5 Posts: 2Member

    Hi, I'm trying to use VariantRecalibrator on an exome dataset. I reduced the bams and created the vcf files using UnifiedGenotyper. The SNP calling works great but the indel calling gives an error. I'm using the Mills_and_1000G_gold_standard.indels.b37.vcf that was recomended.

    But it gives me the following error " ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations."

    Although i checked and the annotationts are in my files i used VariantAnnotator to redo them. But i'm still getting the same error running VariantRecalibrator on indels(SNP mode works fine).

    Has anyone else had problems using Mills_and_1000G_gold_standard.indels.b37.vcf for indel calling? thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    Hi there, can you tell me what version you are using and what is your command line?

    Geraldine Van der Auwera, PhD

  • ik5ik5 Posts: 2Member

    Afcourse im using the latest version GenomeAnalysisTK-2.2-8

    i've added my code for indel and SNPs below.

    Indel calling:

    bsub -G team -q long -R'select[mem>8000] rusage[mem=8000]' -M8000000  -e log6.e -o log6.o java -Xmx8g -jar /ik5/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar \
       -R /ik5/GenomeAnalysisTK-2.2-8-gec077cd/resources/human_g1k_v37.fasta \
       -T VariantRecalibrator \
       -input /ik5/BAMS/output.reduced.UnifiedGenotyper.raw.snps.indels_chr13.vcf \
       --maxGaussians 4 -std 10.0 -percentBad 0.12 \
       -resource:mills,known=true,training=true,truth=true,prior=12.0 /ik5/GenomeAnalysisTK-2.2-8-gec077cd/resources/Mills_and_1000G_gold_standard.indels.b37.vcf \
       -mode INDEL \
       -an QD -an FS -an HaplotypeScore -an ReadPosRankSum -an InbreedingCoeff \
       -L 13 \
       -rscriptFile /ik5/output/reduced.UnifiedGenotyper.Indel.chr13.plots.R \
       -recalFile /ik5/output/reduced.UnifiedGenotyper.Indel.chr13.recal \
       -tranchesFile /ik5/output/reduced.UnifiedGenotyper.Indel.chr13.tranches
    

    SNP calling:

    bsub -G team -q long -R'select[mem>8000] rusage[mem=8000]' -M8000000  -e log7.e -o log7.o java -Xmx8g -jar /ik5/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar \
       -R /ik5/GenomeAnalysisTK-2.2-8-gec077cd/resources/human_g1k_v37.fasta \
       -T VariantRecalibrator \
       -input /ik5/BAMs/output.reduced.UnifiedGenotyper.raw.snps.indels_chr13.vcf \
       --maxGaussians 6 \
       -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /ik5/GenomeAnalysisTK-2.2-8-gec077cd/resources/hapmap_3.3.b37.vcf \
       -resource:omni,known=false,training=true,truth=false,prior=12.0 /ik5/GenomeAnalysisTK-2.2-8-gec077cd/resources/1000G_omni2.5.b37.vcf \
       -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /ik5/GenomeAnalysisTK-2.2-8-gec077cd/resources/dbsnp_137.b37.vcf \
       -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \
       -mode SNP \
       -L 13 \
       -rscriptFile /ik5/output/ASDFI.reduced.UnifiedGenotyper.SNPs.chr13.plots.R \
       -recalFile /ik5/output/ASDFI.reduced.UnifiedGenotyper.SNPs.chr13.recal \
       -tranchesFile /ik5/output/ASDFI.reduced.UnifiedGenotyper.SNPs.chr13.tranches
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    Well, the most likely explanation is that there is no overlap between your training dataset and your callset. This can happen if your exome dataset is quite small. You may need to use hard filtering instead of VQSR; see our Best Practices recommendations for details.

    Geraldine Van der Auwera, PhD

  • lucdhlucdh Posts: 4Member

    Hi, concerning the recommendation to include > 30 bam files in an exome capture experiment: does the selection of those extra samples (from the 1000 Genomes Project) require special attention when analysing a single sample, e.g., in a diagnostic setting? In particular, is there any danger of (1) introducing a bias in the variant calling by including a set of bam files that is biased towards a particular population ; or (2) suppressing de novo mutations and other events that occur in the sample of interest but not in the extended (the bigger the better?) set?

    Post edited by lucdh on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    To be honest, we're not sure because we haven't had the opportunity to test this extensively on different datasets. That recommendation stems from an experiment we did over a year ago, in which we found that it was better to add 30 samples from 1000 Genomes to the mix than to call one exome alone. The extra samples were not selected with any particular criteria. But we would have to do more experiments before we can say anything more concrete. In the meantime I would recommend that you try it that way (adding samples without special attention to the selection) and then also with hard filters, and then decide for yourself which callset looks less biased. Please feel free to share your observations on the forum, as I'm sure others in the community would be interested to benefit from your experience.

    Geraldine Van der Auwera, PhD

  • lucdhlucdh Posts: 4Member

    Hi Geraldine, Thanks for your answer, good to know this issue is unresolved. We will see if we can learn something from experiments with in house and 1000 Genomes mixes and share whatever we find. In addition, it would be very helpful for setting up the experiments if GATK-developers, based on their understanding of the algorithms, could share their ideas about the steps where the extra samples could make a difference and how they might impact the outcome. --luc

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    Hi Luc,

    What it comes down to is VQSR's machine learning model. The principles are explained in the documentation, but I agree it's difficult to grasp the deeper implications. It's also something that's difficult to explain in written texts. For that reason, we are planning to hold live Q&A sessions in which users like yourself will be able to ask questions and have the developers answer them in real-time. We're hoping this might help answer these kinds of questions.

    Geraldine Van der Auwera, PhD

  • jerryliu2005jerryliu2005 Posts: 1Member

    Hi, I'm trying to call variants for exomes which are from pooled samples. We have 40 pools which each is from a pool of exomes from 10 individuals. I used UG with -ploidy 20 parameter to call each pool and run hard-filtering. I'm seeking advice on whether it is better to merge all recalibrated BAMs from all pools and use UG to call SNPs using "-ploidy 800". Then I can use VQSR for filtering, right? Any suggestions?

    Thanks a lot! Jerry

  • delangeldelangel Posts: 61GSA Official Member mod

    Don't call with -ploidy 800! Most probably the calls will take forever and your jobs will die from lack of memory. You should definitely call all BAMs together but do multi-sample calling and keeping the -ploidy 20 parameter which is the number of chromosomes per pool. You will then get a single vcf with all your samples but with separate genotypes for each pool (in the pooled case, the "genotype" will be the most likely # of alt alleles in each pool). I ASSUME you can use VQSR but honestly we've never tried it with pools of data so we have no idea what kind of results it will produce, nor what the best practices are in this case.

  • TechnicalVaultTechnicalVault Posts: 29Member

    On: http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr#latest You recommend the use of the HRun annotation:

    How do I know which annotations to use for my data?

    The five annotation values provided in the command lines above (QD, HaplotypeScore, MQRankSum, ReadPosRankSum, and HRun) have been show to give good results for a variety of data types.

    But HRun is not included in any of the recommendations here. Is one resource out of date? Or are there specific datasets where HRun is appropriate and others where it's not.

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    Yes, the doc is a little out of date; we no longer use HRun.

    Geraldine Van der Auwera, PhD

  • cgecge Posts: 3Member

    Hi, I`m finding with the INDELS the same error that IK5 found. With SNPs I have not many problems. My vcf file has been generated using UnifiedGenotyper from 42 BAM files, from a target resequencing experiment of a single gene. What are the ranges values for maxGaussian, percendBAd and std that could I try to set? Thank you very much.

    Here are the command:

    java -Xmx24g -jar ${GATK} \ -T VariantRecalibrator \ -R $REF \ -input all_bis_mark_dup.indel.rc.bam.raw.vcf \ -resource:mills,known=true,training=true,truth=true,prior=12.0 $ref_dir/Mills_and_1000G_gold_standard.indels.b37.vcf \ -an QD -an FS -an HaplotypeScore -an ReadPosRankSum -an InbreedingCoeff \ --maxGaussians 4 -percentBad 0.05 -std 10 \ -mode INDEL \ -L 13 \ -recalFile $final_dir/output_INDEL.recal \ -tranchesFile $final_dir/output_INDEL.tranches \ -rscriptFile $final_dir/output.plots_INDEL.R

    Here are the error message:

    INFO 14:30:24,217 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:30:24,220 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.2-16-g9f648cb, Compiled 2012/12/04 03:46:58 INFO 14:30:24,220 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 14:30:24,220 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 14:30:24,226 HelpFormatter - Program Args: -T VariantRecalibrator -R /users04/ipiras/ref/Homo_sapiens.GRCh37.62.fa -input all_bis_mark_dup.indel.rc.bam.raw.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 /ref/Mills_and_1000G_gold_standard.indels.b37.vcf -an QD -an FS -an HaplotypeScore -an ReadPosRankSum -an InbreedingCoeff --maxGaussians 1 -percentBad 0.000001 -std 10 -mode INDEL -L 13 -recalFile /final_output/output_INDEL.recal -tranchesFile /final_output/output_INDEL.tranches -rscriptFile /final_output/output.plots_INDEL.R INFO 14:30:24,227 HelpFormatter - Date/Time: 2013/02/26 14:30:24 INFO 14:30:24,227 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:30:24,227 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:30:24,250 ArgumentTypeDescriptor - Dynamically determined type of all_bis_mark_dup.indel.rc.bam.raw.vcf to be VCF INFO 14:30:24,269 ArgumentTypeDescriptor - Dynamically determined type of /users04/ipiras/ref/Mills_and_1000G_gold_standard.indels.b37.vcf to be VCF INFO 14:30:24,284 GenomeAnalysisEngine - Strictness is SILENT INFO 14:30:24,453 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE Target Coverage: 1000
    INFO 14:30:24,585 RMDTrackBuilder - Loading Tribble index from disk for file all_bis_mark_dup.indel.rc.bam.raw.vcf INFO 14:30:24,689 RMDTrackBuilder - Loading Tribble index from disk for file /users04/ipiras/ref/Mills_and_1000G_gold_standard.indels.b37.vcf WARN 14:30:24,793 VCFStandardHeaderLines$Standards - Repairing standard header line for field GQ because -- type disagree; header has Float but standard is Integer INFO 14:30:24,808 GenomeAnalysisEngine - Processing 115169878 bp from intervals INFO 14:30:24,824 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 14:30:24,824 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining WARN 14:30:24,888 Utils - ******************************************************************************** WARN 14:30:24,889 Utils - * WARNING: WARN 14:30:24,889 Utils - * WARN 14:30:24,889 Utils - * Rscript not found in environment path. WARN 14:30:24,889 Utils - * /huentelman/057_KIBRA_RD_Nextera_6_24_11/FASTQs/Project_KIBRA_RD_Nexte WARN 14:30:24,889 Utils - * a/final_output/output.plots_INDEL.R will be generated but PDF plots WARN 14:30:24,890 Utils - * will not. WARN 14:30:24,890 Utils - ******************************************************************************** INFO 14:30:24,894 TrainingSet - Found mills track: Known = true Training = true Truth = true Prior = Q12.0 INFO 14:30:28,472 VariantDataManager - QD: mean = NaN standard deviation = NaN INFO 14:30:29,705 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.2-16-g9f648cb):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinstitute.org/discussion/49/using-variant-annotator
    ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    The error you're reporting here has nothing to do with value ranges; it seems you're missing annotations. See the error message for more information.

    Geraldine Van der Auwera, PhD

  • TechnicalVaultTechnicalVault Posts: 29Member

    Given that in 2.4 the Haplotype Score annotation no longer applies to indels. Could we get an updated recommendation for the INDEL command line please? Thank you

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    For now, just remove the HS annotation from the command line; the rest still applies in the same way.

    Geraldine Van der Auwera, PhD

  • barakbarak Posts: 1Member

    Hi I'm analyzing a single sample, high coverage whole genome seq data (Illumina at ~ x60). Do you have any recommendations for variant re-calibration here? Is it appropriate for this scenario? I did variant calling using this single sample with the unified genotyper (Haplotype caller is still running ...)

    thanks Barak

  • bpowbpow Posts: 5Member

    I could be missing something, but the documentation above refers to hapmap_3.3.b37.sites.vcf and 1000G_omni2.5.b37.sites.vcf and suggests:

    Using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

    I think the files in the GATK resource bundle hapmap_3.3.b37.vcf.gz and 1000G_omni2.5.b37.vcf.gz are sites-only files, but not named as such. If so, the documentation would be clearer if it used the same file names as the resource bundle.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    That's a fair point, we'll try to homogenize the names between the docs and the bundle files. Thanks for pointing this out.

    Geraldine Van der Auwera, PhD

  • rpoplinrpoplin Posts: 91GSA Official Member mod

    Hi all, this page was updated in April 2013 to reflect the results of many experiments run at the Broad to provide the best possible recommendations to the community for how best to use the VQSR in conjunction with the current version of the UnifiedGenotyper and HaplotypeCaller.

  • vipingjovipingjo Posts: 0Member

    Hi, the file mentioned above '1000G_phase1.snps.high_confidence.vcf' was not found at ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.3/b37. Where can I found it?

  • egafniegafni Posts: 6Member

    I am also unable to locate the 1000G phase1 high_confidence file

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    Hi folks,

    The file is not available yet, sorry. This is due to a slight timing issue between documentation updates and the release process -- the new file will be in the new version of the bundle which will be released with version 2.5 within a few days.

    Geraldine Van der Auwera, PhD

  • redzengenoistredzengenoist Posts: 7Member

    Hi Geraldine,

    I also get NaN LOD: `

    ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)`

    But I'm testing on 1 sample which is WGS at 4x, is it likely that the set just isn't big enough?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    Hi @redzengenoist,

    Just one sample is a little short, I'm afraid. You should consider supplementing it with some other callset as described in the Best Practices recommendations; we typically use calls from the 1000 Genomes Project.

    Geraldine Van der Auwera, PhD

  • bandersoCHGbandersoCHG Posts: 1Member

    Hello,

    I am working with the newly suggested 1000G_phase1.snps.high_confidence.vcf variants to train the error model for my whole exome sequencing experiment. However, when I attempt to run VariantRecalibrator I get the following error:

    INFO 09:43:24,634 HelpFormatter - Date/Time: 2013/05/09 09:43:24 INFO 09:43:24,634 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:43:24,634 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:43:24,647 ArgumentTypeDescriptor - Dynamically determined type of snpsUG_NTDplusLGMD_allsamples.raw.vcf to be VCF INFO 09:43:24,650 ArgumentTypeDescriptor - Dynamically determined type of /chg/mitosis1/ngs_references/hapmap_3.3.b37.sites.vcf to be VCF INFO 09:43:24,651 ArgumentTypeDescriptor - Dynamically determined type of /chg/mitosis1/ngs_references/1000G_omni2.5.b37.sites.vcf to be VCF INFO 09:43:24,652 ArgumentTypeDescriptor - Dynamically determined type of /chg/mitosis1/ngs_references/1000G_phase1.snps.high_confidence.hg19.vcf to be VCF INFO 09:43:24,654 ArgumentTypeDescriptor - Dynamically determined type of /chg/mitosis1/ngs_references/dbsnp_137.b37.vcf to be VCF INFO 09:43:25,355 GenomeAnalysisEngine - Strictness is SILENT INFO 09:43:25,463 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 09:43:25,490 RMDTrackBuilder - Loading Tribble index from disk for file snpsUG_NTDplusLGMD_allsamples.raw.vcf INFO 09:43:25,667 RMDTrackBuilder - Loading Tribble index from disk for file /chg/mitosis1/ngs_references/hapmap_3.3.b37.sites.vcf INFO 09:43:25,685 RMDTrackBuilder - Loading Tribble index from disk for file /chg/mitosis1/ngs_references/1000G_omni2.5.b37.sites.vcf INFO 09:43:25,724 RMDTrackBuilder - Creating Tribble index in memory for file /chg/mitosis1/ngs_references/1000G_phase1.snps.high_confidence.hg19.vcf INFO 09:47:23,671 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Input files /chg/mitosis1/ngs_references/1000G_phase1.snps.high_confidence.hg19.vcf and reference have incompatible contigs: No overlapping contigs found.
    ERROR /chg/mitosis1/ngs_references/1000G_phase1.snps.high_confidence.hg19.vcf contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX]
    ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1]
    ERROR ------------------------------------------------------------------------------------------

    So, is there incosistency between the high_confidence variant list and the reference? I am using human_g1k_v37.fasta

    Thanks, any help would be greatly appreciated.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    Hi @bandersoCHG,

    You are using the wrong reference -- you must use the same reference as your reads were aligned against. Based on the contig names listed in the error message (and the file name of the variants) you should be using the hg19 reference. You can get it from our resource bundle if you don't have it already.

    Geraldine Van der Auwera, PhD

  • kyasunokyasuno Posts: 14Member

    Hi, It seems that the documentation has been changed dramatically (e.g. exome specific recommendations disappeared and it apparently requires separate VQSR applications to SNPs and INDELs, respectively, even for HaplotypeCaller [how can we handle other types of variants detected by HC?]).
    Any help to interpret these changes would be greatly appreciated. Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    Hi @kyasuno,

    The documentation was updated after we re-evaluated the results we were getting with the latest versions of the tools. The changes are too sweeping for us to be able to provide a blow-by-blow explanation of why these changes were made. Ultimately these should simply be taken as our best recommendations for using the tools in their current state. We'd be happy to answer any specific questions you may have about specific points of course.

    Geraldine Van der Auwera, PhD

  • kyasunokyasuno Posts: 14Member

    Hi Geraldine,

    Sorry for my vague comments.

    I'm most interested in how MNPs and block substitutions will be handled if we need to apply VQSR to SNPs and INDELs separately for HaplotypeCaller's call set. In the previous version of recommendation, all the variant types were supposed to be analyzed together, so I assumed that complex variants like MNPs and block substitusions were analyzed properly.

    Secondly, it would be great if I can learn some rationale behind the condition "-minNumBad 1000" being added. Does it tell us that training for bad variants using 1000 of worst scoring sites would be sufficient and default value of 2500 is not sensible (or is redundant)? When a specified value of percentBadVariants results in, say, 1500 sites (> 1000 of minNumBad), does VQSR use 1500 sites for training?

    Thanks

  • rpoplinrpoplin Posts: 91GSA Official Member mod

    Hi there,

    1.) The names of the modes are a little ambiguous. The SNP mode handles SNP variants while the INDEL mode does everything else-- including in the model the MNP and complex variants. In our tests we achieve better results by breaking out the SNPs from the rest of the variants and modeling them separately.

    2.) The number of negative training examples is the maximum between the flat number specified by -minNumBad and the number calculated with -percentBadVariants. The value was lowered in the updated best practice recommendations because our testing revealed that there are much fewer false positive raw calls being made in the newer pipeline.

    Cheers,

Sign In or Register to comment.