Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

What VQSR training sets / arguments should I use for my specific project?

rpoplinrpoplin Posts: 118GSA Member mod
edited April 15 in FAQs

This document describes the resource datasets and arguments to use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on our work with human genomes.

Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs (see the VQSR documentation for more details).

These recommendations are valid for use with calls generated by both the UnifiedGenotyper and HaplotypeCaller. In the past we made a distinction in how we processed the calls from these two callers, but now we treat them the same way. These recommendations will probably not work properly on calls generated by other (non-GATK) callers.

Resource datasets

The human genome training, truth and known resource datasets mentioned in this document are all available from our resource bundle.

If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties corresponding to those described below. To generate your own resource set, one idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP callers in addition to the UnifiedGenotyper or HaplotypeCaller, and use those sites which are concordant between the different methods as truth data. In either case, you'll need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.

Resources for SNPs

  • True sites training resource: HapMap

    This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

  • True sites training resource: Omni

    This resource is a set of polymorphic SNP sites produced by the Omni geno- typing array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

  • Non-true sites training resource: 1000G
    This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this re- source may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (%). 17

  • Known sites resource, not used in training: dbSNP
    This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

Resources for Indels

  • Known and true sites training resource: Mills
    This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

VariantRecalibrator

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

Common, base command line

java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R path/to/reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -recalFile path/to/output.recal \
   -tranchesFile path/to/output.tranches \
   -nt 4 \
   [SPECIFY TRUTH AND TRAINING SETS] \
   [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
   [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \

SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle.

Arguments for VariantRecalibrator command:

   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
   -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
   -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an DP -an InbreedingCoeff \
   -mode SNP \

Please note that these recommendations are formulated for whole-genome datasets. For exomes, we do not recommend using DP for variant recalibration (see below for details of why).

Note also that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

Important notes about annotations

Some of these annotations might not be the best for your particular dataset.

Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.

Additionally, the UnifiedGenotyper produces a statistic called the HaplotypeScore which should be used for SNPs. This statistic isn't necessary for the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes.

The InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples please omit this annotation from the command line.

Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

  • Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline)
  • Use the VQSR with the smaller variant callset but experiment with the precise argument settings (try adding --maxGaussians 4 to your command line, for example)

Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.

Arguments for VariantRecalibrator:

   --maxGaussians 4 \
   -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
   -an QD -an DP -an FS -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \
   -mode INDEL \

Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

Common, base command line

 
 java -Xmx3g -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -tranchesFile path/to/input.tranches \
   -recalFile path/to/input.recal \
   -o path/to/output.recalibrated.filtered.vcf \
   [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
   [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \
 

SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. We typically seek to achieve 99.5% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.5 \
   -mode SNP \

Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. We typically seek to achieve 99.0% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.0 \
   -mode INDEL \
Post edited by rpoplin on
«1

Comments

  • ericminikelericminikel Posts: 24Member

    I have a very small dataset (12 samples, only ~80K variants) and I found that even using the Mills gold standard indel resource, I was still unable to run VariantRecalibrator on indels. Is there any solution to this?

    My call:

    # create model for INDELs
    java -Xmx8g -jar ~/bin/gatk-2.1-5/GenomeAnalysisTK.jar \
        -T VariantRecalibrator \
        -R gatk-bundle-1.5-hg19/ucsc.hg19.fasta \
        -input variants.vcf \
        --maxGaussians 4 -std 10.0 -percentBad 0.05 \
        -resource:mills,known=true,training=true,truth=true,prior=12.0 gatk-bundle-1.5-hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
        -an QD -an FS -an HaplotypeScore -an ReadPosRankSum -an InbreedingCoeff \
        -mode INDEL \
        -recalFile variants.indels.recal \
        -tranchesFile variants.indels.tranches
    

    Some of the output:

    INFO  19:18:00,552 VariantRecalibratorEngine - Convergence after 12 iterations! 
    INFO  19:18:00,554 VariantRecalibratorEngine - Evaluating full set of 5570 variants... 
    INFO  19:18:01,275 GATKRunReport - Uploaded run statistics report to AWS S3 
    
    ##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi Eric,

    Have you looked at the VQSR doc in Methods & Workflows? At the end there is a small collection of FAQs, one question being that of small datasets:

    Can I use the variant quality score recalibrator with my small sequencing experiment? This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties. One piece of advice is to turn down the number of Gaussians used during training and to turn up the number of variants that are used to train the negative model. This can be accomplished by adding --maxGaussians 4 --percentBad 0.05 to your command line.

    Let me know if that little tweak still doesn't help.

    Geraldine Van der Auwera, PhD

  • akbariakbari Posts: 6Member

    Where can I find the raw vcf for 1000G exomes to be used in combination with my small number of exomes for variantrecalibrator? If the raw vcfs are not available for 1000G exomes and I should call the variants from the BAM files myself, can I use the 1000G exomes' vcf called by Unifiedgenotyper with our own samples' vcfs called by haplotypecaller for variantrecalibrator? Thanks.

  • ebanksebanks Posts: 671GSA Member mod

    The recommendations say "Add additional samples for variant calling" not for recalibration. In other words you need to do the calling simultaneously with your own samples plus the 1000 Genomes bams. You cannot just merge together VCFs.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • akbariakbari Posts: 6Member

    Thanks for your answer. But it might not be practical to call variants for extra 20 1000G samples (to have the 30 in total) each time specially with haplotypecaller. It will take forever.

  • ebanksebanks Posts: 671GSA Member mod

    We are actively working to speed up the run time of the Haplotype Caller. In the meantime, if run time is a problem for you, then you should use the Unified Genotyper.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    In fact the HaplotypeCaller's runtime has been significantly improved already in the new release -- check out this overview:

    http://www.broadinstitute.org/gatk/guide/version-history

    Geraldine Van der Auwera, PhD

  • akbariakbari Posts: 6Member

    Thanks for your note. I just downloaded the 2.2 release and I am running the haplotypecaller now and I can attest that it is at least 2 fold faster on my system. But what amazed me the most is the multi threading in baserecalibrator. I ran it with 8 cores and it is amazingly fast. Thumbs up for the team who worked on these! Do you know if multi threading could be used in this release for indelrealigner as well?

    About using Unified Genotyper, I should say that I can not go back to it after using haplotypecaller and seeing the large indels it picks and UG misses them, specially after confirming them with Sanger. I just feel guilty if I want to use UG again!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    The team is very pleased to hear that you like the improvements :)

    Unfortunately you can't use multi-threading with the IndelRealigner at this time, but you can parallelize it with scatter-gather (Queue). We'll have a new document out very soon detailing the parallelism options available for all the main tools.

    Geraldine Van der Auwera, PhD

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GSA Member admin

    Hi Akbari, great to hear you are enjoying GATK 2.2! Any chance you could share some screenshots of those indels with us?

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • ik5ik5 Posts: 2Member

    Hi, I'm trying to use VariantRecalibrator on an exome dataset. I reduced the bams and created the vcf files using UnifiedGenotyper. The SNP calling works great but the indel calling gives an error. I'm using the Mills_and_1000G_gold_standard.indels.b37.vcf that was recomended.

    But it gives me the following error " ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations."

    Although i checked and the annotationts are in my files i used VariantAnnotator to redo them. But i'm still getting the same error running VariantRecalibrator on indels(SNP mode works fine).

    Has anyone else had problems using Mills_and_1000G_gold_standard.indels.b37.vcf for indel calling? thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi there, can you tell me what version you are using and what is your command line?

    Geraldine Van der Auwera, PhD

  • ik5ik5 Posts: 2Member

    Afcourse im using the latest version GenomeAnalysisTK-2.2-8

    i've added my code for indel and SNPs below.

    Indel calling:

    bsub -G team -q long -R'select[mem>8000] rusage[mem=8000]' -M8000000  -e log6.e -o log6.o java -Xmx8g -jar /ik5/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar \
       -R /ik5/GenomeAnalysisTK-2.2-8-gec077cd/resources/human_g1k_v37.fasta \
       -T VariantRecalibrator \
       -input /ik5/BAMS/output.reduced.UnifiedGenotyper.raw.snps.indels_chr13.vcf \
       --maxGaussians 4 -std 10.0 -percentBad 0.12 \
       -resource:mills,known=true,training=true,truth=true,prior=12.0 /ik5/GenomeAnalysisTK-2.2-8-gec077cd/resources/Mills_and_1000G_gold_standard.indels.b37.vcf \
       -mode INDEL \
       -an QD -an FS -an HaplotypeScore -an ReadPosRankSum -an InbreedingCoeff \
       -L 13 \
       -rscriptFile /ik5/output/reduced.UnifiedGenotyper.Indel.chr13.plots.R \
       -recalFile /ik5/output/reduced.UnifiedGenotyper.Indel.chr13.recal \
       -tranchesFile /ik5/output/reduced.UnifiedGenotyper.Indel.chr13.tranches
    

    SNP calling:

    bsub -G team -q long -R'select[mem>8000] rusage[mem=8000]' -M8000000  -e log7.e -o log7.o java -Xmx8g -jar /ik5/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar \
       -R /ik5/GenomeAnalysisTK-2.2-8-gec077cd/resources/human_g1k_v37.fasta \
       -T VariantRecalibrator \
       -input /ik5/BAMs/output.reduced.UnifiedGenotyper.raw.snps.indels_chr13.vcf \
       --maxGaussians 6 \
       -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /ik5/GenomeAnalysisTK-2.2-8-gec077cd/resources/hapmap_3.3.b37.vcf \
       -resource:omni,known=false,training=true,truth=false,prior=12.0 /ik5/GenomeAnalysisTK-2.2-8-gec077cd/resources/1000G_omni2.5.b37.vcf \
       -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 /ik5/GenomeAnalysisTK-2.2-8-gec077cd/resources/dbsnp_137.b37.vcf \
       -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \
       -mode SNP \
       -L 13 \
       -rscriptFile /ik5/output/ASDFI.reduced.UnifiedGenotyper.SNPs.chr13.plots.R \
       -recalFile /ik5/output/ASDFI.reduced.UnifiedGenotyper.SNPs.chr13.recal \
       -tranchesFile /ik5/output/ASDFI.reduced.UnifiedGenotyper.SNPs.chr13.tranches
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Well, the most likely explanation is that there is no overlap between your training dataset and your callset. This can happen if your exome dataset is quite small. You may need to use hard filtering instead of VQSR; see our Best Practices recommendations for details.

    Geraldine Van der Auwera, PhD

  • lucdhlucdh Posts: 7Member
    edited January 2013

    Hi, concerning the recommendation to include > 30 bam files in an exome capture experiment: does the selection of those extra samples (from the 1000 Genomes Project) require special attention when analysing a single sample, e.g., in a diagnostic setting? In particular, is there any danger of (1) introducing a bias in the variant calling by including a set of bam files that is biased towards a particular population ; or (2) suppressing de novo mutations and other events that occur in the sample of interest but not in the extended (the bigger the better?) set?

    Post edited by lucdh on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    To be honest, we're not sure because we haven't had the opportunity to test this extensively on different datasets. That recommendation stems from an experiment we did over a year ago, in which we found that it was better to add 30 samples from 1000 Genomes to the mix than to call one exome alone. The extra samples were not selected with any particular criteria. But we would have to do more experiments before we can say anything more concrete. In the meantime I would recommend that you try it that way (adding samples without special attention to the selection) and then also with hard filters, and then decide for yourself which callset looks less biased. Please feel free to share your observations on the forum, as I'm sure others in the community would be interested to benefit from your experience.

    Geraldine Van der Auwera, PhD

  • lucdhlucdh Posts: 7Member

    Hi Geraldine, Thanks for your answer, good to know this issue is unresolved. We will see if we can learn something from experiments with in house and 1000 Genomes mixes and share whatever we find. In addition, it would be very helpful for setting up the experiments if GATK-developers, based on their understanding of the algorithms, could share their ideas about the steps where the extra samples could make a difference and how they might impact the outcome. --luc

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi Luc,

    What it comes down to is VQSR's machine learning model. The principles are explained in the documentation, but I agree it's difficult to grasp the deeper implications. It's also something that's difficult to explain in written texts. For that reason, we are planning to hold live Q&A sessions in which users like yourself will be able to ask questions and have the developers answer them in real-time. We're hoping this might help answer these kinds of questions.

    Geraldine Van der Auwera, PhD

  • jerryliu2005jerryliu2005 Posts: 1Member

    Hi, I'm trying to call variants for exomes which are from pooled samples. We have 40 pools which each is from a pool of exomes from 10 individuals. I used UG with -ploidy 20 parameter to call each pool and run hard-filtering. I'm seeking advice on whether it is better to merge all recalibrated BAMs from all pools and use UG to call SNPs using "-ploidy 800". Then I can use VQSR for filtering, right? Any suggestions?

    Thanks a lot! Jerry

  • delangeldelangel Posts: 71GSA Member mod

    Don't call with -ploidy 800! Most probably the calls will take forever and your jobs will die from lack of memory. You should definitely call all BAMs together but do multi-sample calling and keeping the -ploidy 20 parameter which is the number of chromosomes per pool. You will then get a single vcf with all your samples but with separate genotypes for each pool (in the pooled case, the "genotype" will be the most likely # of alt alleles in each pool). I ASSUME you can use VQSR but honestly we've never tried it with pools of data so we have no idea what kind of results it will produce, nor what the best practices are in this case.

  • TechnicalVaultTechnicalVault Sanger, Cambridge, UKPosts: 59Member

    On: http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr#latest You recommend the use of the HRun annotation:

    How do I know which annotations to use for my data?

    The five annotation values provided in the command lines above (QD, HaplotypeScore, MQRankSum, ReadPosRankSum, and HRun) have been show to give good results for a variety of data types.

    But HRun is not included in any of the recommendations here. Is one resource out of date? Or are there specific datasets where HRun is appropriate and others where it's not.

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Yes, the doc is a little out of date; we no longer use HRun.

    Geraldine Van der Auwera, PhD

  • cgecge Posts: 3Member

    Hi, I`m finding with the INDELS the same error that IK5 found. With SNPs I have not many problems. My vcf file has been generated using UnifiedGenotyper from 42 BAM files, from a target resequencing experiment of a single gene. What are the ranges values for maxGaussian, percendBAd and std that could I try to set? Thank you very much.

    Here are the command:

    java -Xmx24g -jar ${GATK} \ -T VariantRecalibrator \ -R $REF \ -input all_bis_mark_dup.indel.rc.bam.raw.vcf \ -resource:mills,known=true,training=true,truth=true,prior=12.0 $ref_dir/Mills_and_1000G_gold_standard.indels.b37.vcf \ -an QD -an FS -an HaplotypeScore -an ReadPosRankSum -an InbreedingCoeff \ --maxGaussians 4 -percentBad 0.05 -std 10 \ -mode INDEL \ -L 13 \ -recalFile $final_dir/output_INDEL.recal \ -tranchesFile $final_dir/output_INDEL.tranches \ -rscriptFile $final_dir/output.plots_INDEL.R

    Here are the error message:

    INFO 14:30:24,217 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:30:24,220 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.2-16-g9f648cb, Compiled 2012/12/04 03:46:58 INFO 14:30:24,220 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 14:30:24,220 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 14:30:24,226 HelpFormatter - Program Args: -T VariantRecalibrator -R /users04/ipiras/ref/Homo_sapiens.GRCh37.62.fa -input all_bis_mark_dup.indel.rc.bam.raw.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 /ref/Mills_and_1000G_gold_standard.indels.b37.vcf -an QD -an FS -an HaplotypeScore -an ReadPosRankSum -an InbreedingCoeff --maxGaussians 1 -percentBad 0.000001 -std 10 -mode INDEL -L 13 -recalFile /final_output/output_INDEL.recal -tranchesFile /final_output/output_INDEL.tranches -rscriptFile /final_output/output.plots_INDEL.R INFO 14:30:24,227 HelpFormatter - Date/Time: 2013/02/26 14:30:24 INFO 14:30:24,227 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:30:24,227 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:30:24,250 ArgumentTypeDescriptor - Dynamically determined type of all_bis_mark_dup.indel.rc.bam.raw.vcf to be VCF INFO 14:30:24,269 ArgumentTypeDescriptor - Dynamically determined type of /users04/ipiras/ref/Mills_and_1000G_gold_standard.indels.b37.vcf to be VCF INFO 14:30:24,284 GenomeAnalysisEngine - Strictness is SILENT INFO 14:30:24,453 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE Target Coverage: 1000
    INFO 14:30:24,585 RMDTrackBuilder - Loading Tribble index from disk for file all_bis_mark_dup.indel.rc.bam.raw.vcf INFO 14:30:24,689 RMDTrackBuilder - Loading Tribble index from disk for file /users04/ipiras/ref/Mills_and_1000G_gold_standard.indels.b37.vcf WARN 14:30:24,793 VCFStandardHeaderLines$Standards - Repairing standard header line for field GQ because -- type disagree; header has Float but standard is Integer INFO 14:30:24,808 GenomeAnalysisEngine - Processing 115169878 bp from intervals INFO 14:30:24,824 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 14:30:24,824 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining WARN 14:30:24,888 Utils - ******************************************************************************** WARN 14:30:24,889 Utils - * WARNING: WARN 14:30:24,889 Utils - * WARN 14:30:24,889 Utils - * Rscript not found in environment path. WARN 14:30:24,889 Utils - * /huentelman/057_KIBRA_RD_Nextera_6_24_11/FASTQs/Project_KIBRA_RD_Nexte WARN 14:30:24,889 Utils - * a/final_output/output.plots_INDEL.R will be generated but PDF plots WARN 14:30:24,890 Utils - * will not. WARN 14:30:24,890 Utils - ******************************************************************************** INFO 14:30:24,894 TrainingSet - Found mills track: Known = true Training = true Truth = true Prior = Q12.0 INFO 14:30:28,472 VariantDataManager - QD: mean = NaN standard deviation = NaN INFO 14:30:29,705 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.2-16-g9f648cb):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinstitute.org/discussion/49/using-variant-annotator
    ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    The error you're reporting here has nothing to do with value ranges; it seems you're missing annotations. See the error message for more information.

    Geraldine Van der Auwera, PhD

  • TechnicalVaultTechnicalVault Sanger, Cambridge, UKPosts: 59Member

    Given that in 2.4 the Haplotype Score annotation no longer applies to indels. Could we get an updated recommendation for the INDEL command line please? Thank you

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    For now, just remove the HS annotation from the command line; the rest still applies in the same way.

    Geraldine Van der Auwera, PhD

  • barakbarak Posts: 5Member

    Hi I'm analyzing a single sample, high coverage whole genome seq data (Illumina at ~ x60). Do you have any recommendations for variant re-calibration here? Is it appropriate for this scenario? I did variant calling using this single sample with the unified genotyper (Haplotype caller is still running ...)

    thanks Barak

  • bpowbpow Posts: 6Member

    I could be missing something, but the documentation above refers to hapmap_3.3.b37.sites.vcf and 1000G_omni2.5.b37.sites.vcf and suggests:

    Using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

    I think the files in the GATK resource bundle hapmap_3.3.b37.vcf.gz and 1000G_omni2.5.b37.vcf.gz are sites-only files, but not named as such. If so, the documentation would be clearer if it used the same file names as the resource bundle.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    That's a fair point, we'll try to homogenize the names between the docs and the bundle files. Thanks for pointing this out.

    Geraldine Van der Auwera, PhD

  • rpoplinrpoplin Posts: 118GSA Member mod

    Hi all, this page was updated in April 2013 to reflect the results of many experiments run at the Broad to provide the best possible recommendations to the community for how best to use the VQSR in conjunction with the current version of the UnifiedGenotyper and HaplotypeCaller.

  • vipingjovipingjo Posts: 0Member

    Hi, the file mentioned above '1000G_phase1.snps.high_confidence.vcf' was not found at ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.3/b37. Where can I found it?

  • egafniegafni Posts: 7Member

    I am also unable to locate the 1000G phase1 high_confidence file

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi folks,

    The file is not available yet, sorry. This is due to a slight timing issue between documentation updates and the release process -- the new file will be in the new version of the bundle which will be released with version 2.5 within a few days.

    Geraldine Van der Auwera, PhD

  • redzengenoistredzengenoist Posts: 24Member

    Hi Geraldine,

    I also get NaN LOD: `

    ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)`

    But I'm testing on 1 sample which is WGS at 4x, is it likely that the set just isn't big enough?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi @redzengenoist,

    Just one sample is a little short, I'm afraid. You should consider supplementing it with some other callset as described in the Best Practices recommendations; we typically use calls from the 1000 Genomes Project.

    Geraldine Van der Auwera, PhD

  • bandersoCHGbandersoCHG Posts: 1Member

    Hello,

    I am working with the newly suggested 1000G_phase1.snps.high_confidence.vcf variants to train the error model for my whole exome sequencing experiment. However, when I attempt to run VariantRecalibrator I get the following error:

    INFO 09:43:24,634 HelpFormatter - Date/Time: 2013/05/09 09:43:24 INFO 09:43:24,634 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:43:24,634 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:43:24,647 ArgumentTypeDescriptor - Dynamically determined type of snpsUG_NTDplusLGMD_allsamples.raw.vcf to be VCF INFO 09:43:24,650 ArgumentTypeDescriptor - Dynamically determined type of /chg/mitosis1/ngs_references/hapmap_3.3.b37.sites.vcf to be VCF INFO 09:43:24,651 ArgumentTypeDescriptor - Dynamically determined type of /chg/mitosis1/ngs_references/1000G_omni2.5.b37.sites.vcf to be VCF INFO 09:43:24,652 ArgumentTypeDescriptor - Dynamically determined type of /chg/mitosis1/ngs_references/1000G_phase1.snps.high_confidence.hg19.vcf to be VCF INFO 09:43:24,654 ArgumentTypeDescriptor - Dynamically determined type of /chg/mitosis1/ngs_references/dbsnp_137.b37.vcf to be VCF INFO 09:43:25,355 GenomeAnalysisEngine - Strictness is SILENT INFO 09:43:25,463 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 09:43:25,490 RMDTrackBuilder - Loading Tribble index from disk for file snpsUG_NTDplusLGMD_allsamples.raw.vcf INFO 09:43:25,667 RMDTrackBuilder - Loading Tribble index from disk for file /chg/mitosis1/ngs_references/hapmap_3.3.b37.sites.vcf INFO 09:43:25,685 RMDTrackBuilder - Loading Tribble index from disk for file /chg/mitosis1/ngs_references/1000G_omni2.5.b37.sites.vcf INFO 09:43:25,724 RMDTrackBuilder - Creating Tribble index in memory for file /chg/mitosis1/ngs_references/1000G_phase1.snps.high_confidence.hg19.vcf INFO 09:47:23,671 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Input files /chg/mitosis1/ngs_references/1000G_phase1.snps.high_confidence.hg19.vcf and reference have incompatible contigs: No overlapping contigs found.
    ERROR /chg/mitosis1/ngs_references/1000G_phase1.snps.high_confidence.hg19.vcf contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX]
    ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1]
    ERROR ------------------------------------------------------------------------------------------

    So, is there incosistency between the high_confidence variant list and the reference? I am using human_g1k_v37.fasta

    Thanks, any help would be greatly appreciated.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi @bandersoCHG,

    You are using the wrong reference -- you must use the same reference as your reads were aligned against. Based on the contig names listed in the error message (and the file name of the variants) you should be using the hg19 reference. You can get it from our resource bundle if you don't have it already.

    Geraldine Van der Auwera, PhD

  • kyasunokyasuno Posts: 14Member

    Hi, It seems that the documentation has been changed dramatically (e.g. exome specific recommendations disappeared and it apparently requires separate VQSR applications to SNPs and INDELs, respectively, even for HaplotypeCaller [how can we handle other types of variants detected by HC?]).
    Any help to interpret these changes would be greatly appreciated. Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi @kyasuno,

    The documentation was updated after we re-evaluated the results we were getting with the latest versions of the tools. The changes are too sweeping for us to be able to provide a blow-by-blow explanation of why these changes were made. Ultimately these should simply be taken as our best recommendations for using the tools in their current state. We'd be happy to answer any specific questions you may have about specific points of course.

    Geraldine Van der Auwera, PhD

  • kyasunokyasuno Posts: 14Member

    Hi Geraldine,

    Sorry for my vague comments.

    I'm most interested in how MNPs and block substitutions will be handled if we need to apply VQSR to SNPs and INDELs separately for HaplotypeCaller's call set. In the previous version of recommendation, all the variant types were supposed to be analyzed together, so I assumed that complex variants like MNPs and block substitusions were analyzed properly.

    Secondly, it would be great if I can learn some rationale behind the condition "-minNumBad 1000" being added. Does it tell us that training for bad variants using 1000 of worst scoring sites would be sufficient and default value of 2500 is not sensible (or is redundant)? When a specified value of percentBadVariants results in, say, 1500 sites (> 1000 of minNumBad), does VQSR use 1500 sites for training?

    Thanks

  • rpoplinrpoplin Posts: 118GSA Member mod

    Hi there,

    1.) The names of the modes are a little ambiguous. The SNP mode handles SNP variants while the INDEL mode does everything else-- including in the model the MNP and complex variants. In our tests we achieve better results by breaking out the SNPs from the rest of the variants and modeling them separately.

    2.) The number of negative training examples is the maximum between the flat number specified by -minNumBad and the number calculated with -percentBadVariants. The value was lowered in the updated best practice recommendations because our testing revealed that there are much fewer false positive raw calls being made in the newer pipeline.

    Cheers,

  • KurtKurt Posts: 110Member ✭✭✭

    Howdy,

    I've been familiarizing myself with the new recommendations for VQSR and in the past, for exome experiments, when using unified genotyper, one of the recommendations was to lower the emit/call confidence in UG from 30 to 20 to increase the number of variants to be used by the mode. Judging by these recent VQSR recommendations, I am inferring that you no longer feel that this is necessary when using the UG+VQSR workflow for exomes. I did some digging around on here to see if I could find some documentation that still referenced it (I think it used to be in the best practices workflow), but didn't come across it anywhere now. Just curious.

    Thanks,

    Kurt

  • CarneiroCarneiro Posts: 271Administrator, GSA Member admin

    if you're doing human exomes with decent quality sequencing, you should be okay with the default emit/call confidences.

  • TechnicalVaultTechnicalVault Sanger, Cambridge, UKPosts: 59Member
    edited June 2013

    Can I appeal to you to start version/revision numbering these recommendations, please?

    Post edited by TechnicalVault on

    Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Yes we're going to try to set that up.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 182Member
    edited June 2013

    @ebanks said: The recommendations say "Add additional samples for variant calling" not for recalibration. In other words you need to do the calling simultaneously with your own samples plus the 1000 Genomes bams. You cannot just merge together VCFs.

    So I should only choose the g1k samples that are similar to my samples, right? For example, if my samples are Chinese, I should choose only Chinese samples from g1k.

    Post edited by blueskypy on
  • ebanksebanks Posts: 671GSA Member mod

    Exactly, as close as you can get them.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • morgantaschukmorgantaschuk Posts: 16Member

    I'm trying to follow the recommendations for recalibrating indels and having problems with the input sets.

    Here is my command line: java -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19_random.fa -mode INDEL -tranche 99.9 -percentBad 0.01 -minNumBad 1000 -input data/merged_raw.vcf -recalFile data/indel.recal -tranchesFile data/indel.tranches -resource:mills,known=true,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg19.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.indels.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.hg19.vcf -an DP -an FS -an MQRankSum -an ReadPosRankSum -et NO_ET -K GATK.key --maxGaussians 4 -nt 8

    And the error: Values for DP annotation not detected for ANY training variant in the input callset

    Following recommendations here, I checked my input dataset, and there are DP annotations on 13/251 lines. However, neither Mills_and_1000G_gold_standard.indels.hg19.vcf nor 1000G_phase1.indels.hg19.vcf have any DP annotations. DBsnp has DP annotations. The training sets are direct from the resource bundle. Do you have any tips on how to proceed?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    The error suggests that those sites that were selected for training (ie sites in your callset that overlap with the resources) do not have the annotation.

    More importantly -- do I read correctly that you have only 251 sites in your callset? If so that is a very small set, and it's very likely that you won't have enough data for recalibration anyway.

    Geraldine Van der Auwera, PhD

  • morgantaschukmorgantaschuk Posts: 16Member

    Thanks for the swift reply. Yes, there are only 251, because I'm using a cut-down dataset while I prototype my method. I just tried to run VariantAnnotator over this dataset and I didn't get any more annotations than HaplotypeCaller put on already. I guess I will try to run it on the full dataset and see if I continue to run into this problem.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Ah, that makes sense. While that works for most steps, VQSR just doesn't run properly on small prototyping sets.

    Let me know if you have any issues running on the full dataset.

    Geraldine Van der Auwera, PhD

  • avidLearneravidLearner Posts: 9Member

    Hello, I haven't had any luck in locating the 1000 Genomes Phase 1 SNPs VCF in the resource bundle at ftp://ftp.broadinstitute.org/bundle/2.3/. I only see the 1000G Phase 1 indels file. Which resource file can I use for the VariantRecalibrator step?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    You're looking in the wrong directory of the bundle; you need to go one level up and find the most recent version number.

    Geraldine Van der Auwera, PhD

  • mimi_luptonmimi_lupton Posts: 8Member

    Dear Geraldine, I have a data set that is custom capture of 500kb I had used both VariantRecalibration and hard filtering on my data to compare the results. The VariantRecalibration appears much less stringent, only filtering 9 SNPs out of 5978, as a pose to hard filtering which filtered 1433. This data set has 932 individuals. Is it appropriate to use VariantRecalibration on a dataset of 500kb?

    Many thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    @mimi_lupton, sorry for skipping you there. What settings did you use for VariantRecalibration, eg what annotations, tranche etc? And did you get any warning from the recalibrator about the number of variants in the model?

    Geraldine Van der Auwera, PhD

  • mimi_luptonmimi_lupton Posts: 8Member

    Thanks for your reply, here is my script;

    /usr/java/latest/bin/java -Xmx8g -jar /share/apps/gatk_current/GenomeAnalysisTK.jar -T VariantRecalibrator \ -R human_g1k_v37.fasta \ -input input.vcf \ -recalFile allsofar050713.recal \ -tranchesFile allsofar050713.tranches \ -percentBad 0.01 -minNumBad 1000 \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 /1000G_phase1.snps.high_confidence.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.b37.vcf \ -an QD -an MQRankSum -an ReadPosRankSum -an FS -an InbreedingCoeff \ -mode SNP

    I checked my output and I have got a warning; "VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable".

    I have pasted the full output below, and also attached the tranches PDF, I'm not really sure how to interpret it.

    Thanks for your help!

    INFO 03:13:31,159 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 03:13:31,285 TrainingSet - Found hapmap track: Known = false Training = true Truth = true Prior = Q15.0 INFO 03:13:31,286 TrainingSet - Found omni track: Known = false Training = true Truth = true Prior = Q12.0 INFO 03:13:31,286 TrainingSet - Found 1000G track: Known = false Training = true Truth = false Prior = Q10.0 INFO 03:13:31,286 TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q2.0 INFO 03:13:31,497 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/jump2grouppowell/Complement/UnifiedGenotyper/allsofar050713.vcf INFO 03:13:31,519 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/jump2grouppowell/Complement/UnifiedGenotyper/allsofar050713.vcf INFO 03:13:31,539 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/jump2grouppowell/Complement/UnifiedGenotyper/allsofar050713.vcf INFO 03:13:31,549 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/bin/GATK/hapmap_3.3.b37.vcf INFO 03:13:31,607 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/bin/GATK/hapmap_3.3.b37.vcf INFO 03:13:31,644 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/bin/GATK/hapmap_3.3.b37.vcf INFO 03:13:31,689 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/bin/GATK/1000G_omni2.5.b37.vcf INFO 03:13:31,710 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/bin/GATK/1000G_omni2.5.b37.vcf INFO 03:13:31,749 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/bin/GATK/1000G_omni2.5.b37.vcf INFO 03:13:31,775 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/bin/GATK/1000G_phase1.snps.high_confidence.b37.vcf INFO 03:13:31,896 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/bin/GATK/1000G_phase1.snps.high_confidence.b37.vcf INFO 03:13:32,032 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/bin/GATK/1000G_phase1.snps.high_confidence.b37.vcf INFO 03:13:32,149 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/bin/GATK/dbsnp_137.b37.vcf INFO 03:13:32,323 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/bin/GATK/dbsnp_137.b37.vcf INFO 03:13:32,450 RMDTrackBuilder - Loading Tribble index from disk for file /home/mluptonbrc/bin/GATK/dbsnp_137.b37.vcf INFO 03:14:01,173 ProgressMeter - 2:202604066 8.13e+06 30.0 s 3.0 s 14.6% 3.4 m 2.9 m INFO 03:14:31,182 ProgressMeter - 5:29222788 1.72e+07 60.0 s 3.0 s 29.4% 3.4 m 2.4 m INFO 03:15:01,191 ProgressMeter - 8:4238499 2.68e+07 90.0 s 3.0 s 45.0% 3.3 m 109.0 s INFO 03:15:31,201 ProgressMeter - 11:125274161 3.72e+07 120.0 s 3.0 s 62.6% 3.2 m 71.0 s INFO 03:16:01,210 ProgressMeter - 17:42649031 4.79e+07 2.5 m 3.0 s 82.0% 3.0 m 32.0 s INFO 03:16:26,705 VariantDataManager - QD: mean = 17.38 standard deviation = 6.30 INFO 03:16:26,709 VariantDataManager - MQRankSum: mean = -8.49 standard deviation = 6.68 INFO 03:16:26,714 VariantDataManager - ReadPosRankSum: mean = -0.78 standard deviation = 7.92 INFO 03:16:26,717 VariantDataManager - FS: mean = 31.30 standard deviation = 145.47 INFO 03:16:26,719 VariantDataManager - InbreedingCoeff: mean = 0.04 standard deviation = 0.08 INFO 03:16:26,726 VariantDataManager - Training with 2144 variants after standard deviation thresholding. INFO 03:16:26,732 GaussianMixtureModel - Initializing model with 30 k-means iterations... INFO 03:16:27,023 VariantRecalibratorEngine - Finished iteration 0. INFO 03:16:27,141 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.50251 INFO 03:16:27,201 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.25074 INFO 03:16:27,259 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.37053 INFO 03:16:27,318 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.08056 INFO 03:16:27,377 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.14237 INFO 03:16:27,423 VariantRecalibratorEngine - Convergence after 29 iterations! INFO 03:16:27,456 VariantRecalibratorEngine - Evaluating full set of 5468 variants... INFO 03:16:27,484 VariantDataManager - Found 0 variants overlapping bad sites training tracks. WARN 03:16:27,490 VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable. INFO 03:16:27,491 VariantDataManager - Additionally training with worst 18.288% of passing data --> 1000 variants with LOD <= -2.3706. INFO 03:16:27,491 GaussianMixtureModel - Initializing model with 30 k-means iterations... INFO 03:16:27,507 VariantRecalibratorEngine - Finished iteration 0. INFO 03:16:27,536 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.77855 INFO 03:16:27,564 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.15900 INFO 03:16:27,592 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.09625 INFO 03:16:27,621 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.05664 INFO 03:16:27,649 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.03025 INFO 03:16:27,660 VariantRecalibratorEngine - Convergence after 27 iterations! INFO 03:16:27,663 VariantRecalibratorEngine - Evaluating full set of 5468 variants... INFO 03:16:27,663 VariantRecalibrator - Negative model failed to converge. Retrying... INFO 03:16:27,663 GaussianMixtureModel - Initializing model with 30 k-means iterations... INFO 03:16:27,676 VariantRecalibratorEngine - Finished iteration 0. INFO 03:16:27,701 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.27127 INFO 03:16:27,728 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.19246 INFO 03:16:27,753 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.08984 INFO 03:16:27,779 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.05495 INFO 03:16:27,805 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.03387 INFO 03:16:27,830 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.03015 INFO 03:16:27,846 VariantRecalibratorEngine - Convergence after 33 iterations! INFO 03:16:27,847 VariantRecalibratorEngine - Evaluating full set of 5468 variants... INFO 03:16:27,866 VariantRecalibratorEngine - Evaluating full set of 5468 variants... INFO 03:16:27,866 VariantRecalibrator - Negative model failed to converge. Retrying... INFO 03:16:27,866 GaussianMixtureModel - Initializing model with 30 k-means iterations... INFO 03:16:27,877 VariantRecalibratorEngine - Finished iteration 0. INFO 03:16:27,900 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.28242 INFO 03:16:27,923 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.15384 INFO 03:16:27,945 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.06462 INFO 03:16:27,967 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.06502 INFO 03:16:27,990 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.05017 INFO 03:16:28,012 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.05920 INFO 03:16:28,035 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.07426 INFO 03:16:28,057 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.08184 INFO 03:16:28,080 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.06037 INFO 03:16:28,102 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.06391 INFO 03:16:28,125 VariantRecalibratorEngine - Finished iteration 55. Current change in mixture coefficients = 0.04143 INFO 03:16:28,147 VariantRecalibratorEngine - Finished iteration 60. Current change in mixture coefficients = 0.03534 INFO 03:16:28,170 VariantRecalibratorEngine - Finished iteration 65. Current change in mixture coefficients = 0.03109 INFO 03:16:28,192 VariantRecalibratorEngine - Finished iteration 70. Current change in mixture coefficients = 0.02960 INFO 03:16:28,214 VariantRecalibratorEngine - Finished iteration 75. Current change in mixture coefficients = 0.01891 INFO 03:16:28,214 VariantRecalibratorEngine - Convergence after 75 iterations! INFO 03:16:28,216 VariantRecalibratorEngine - Evaluating full set of 5468 variants... INFO 03:16:28,233 VariantRecalibratorEngine - Evaluating full set of 5468 variants... INFO 03:16:28,233 VariantRecalibrator - Negative model failed to converge. Retrying... INFO 03:16:28,234 GaussianMixtureModel - Initializing model with 30 k-means iterations... INFO 03:16:28,244 VariantRecalibratorEngine - Finished iteration 0. INFO 03:16:28,263 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.34288 INFO 03:16:28,283 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.23853 INFO 03:16:28,302 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.16176 INFO 03:16:28,321 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.06558 INFO 03:16:28,333 VariantRecalibratorEngine - Convergence after 23 iterations! INFO 03:16:28,334 VariantRecalibratorEngine - Evaluating full set of 5468 variants... INFO 03:16:28,351 VariantRecalibratorEngine - Evaluating full set of 5468 variants... INFO 03:16:28,524 TrancheManager - Finding 4 tranches for 5468 variants INFO 03:16:28,526 TrancheManager - Tranche threshold 100.00 => selection metric threshold 0.000 INFO 03:16:28,528 TrancheManager - Found tranche for 100.000: 0.000 threshold starting with variant 0; running score is 0.000 INFO 03:16:28,528 TrancheManager - Tranche is Tranche ts=100.00 minVQSLod=-Infinity known=(3034 @ 2.4484) novel=(2434 @ 1.9682) truthSites(990 accessible, 990 called), name=anonymous] INFO 03:16:28,529 TrancheManager - Tranche threshold 99.90 => selection metric threshold 0.001 INFO 03:16:28,531 TrancheManager - Found tranche for 99.900: 0.001 threshold starting with variant 9; running score is 0.001 INFO 03:16:28,531 TrancheManager - Tranche is Tranche ts=99.90 minVQSLod=-919688539663927810.0000 known=(3028 @ 2.4494) novel=(2431 @ 1.9718) truthSites(9 90 accessible, 989 called), name=anonymous] INFO 03:16:28,531 TrancheManager - Tranche threshold 99.00 => selection metric threshold 0.010 INFO 03:16:28,533 TrancheManager - Found tranche for 99.000: 0.010 threshold starting with variant 121; running score is 0.010 INFO 03:16:28,533 TrancheManager - Tranche is Tranche ts=99.00 minVQSLod=-42429540260765608.0000 known=(2987 @ 2.4660) novel=(2360 @ 2.0974) truthSites(99 0 accessible, 980 called), name=anonymous] INFO 03:16:28,533 TrancheManager - Tranche threshold 90.00 => selection metric threshold 0.100 INFO 03:16:28,534 TrancheManager - Found tranche for 90.000: 0.100 threshold starting with variant 525; running score is 0.100 INFO 03:16:28,534 TrancheManager - Tranche is Tranche ts=90.00 minVQSLod=-6148189719669475.0000 known=(2734 @ 2.4750) novel=(2209 @ 2.1174) truthSites(990 accessible, 891 called), name=anonymous] INFO 03:16:28,536 VariantRecalibrator - Writing out recalibration table... INFO 03:16:28,881 VariantRecalibrator - Executing: Rscript (resource)org/broadinstitute/sting/gatk/walkers/variantrecalibration/plot_Tranches.R /home/mlupto nbrc/jump2grouppowell/Complement/UnifiedGenotyper/allsofar050713.tranches 2.15 INFO 03:16:31,219 ProgressMeter - GL000202.1:10465 5.66e+07 3.0 m 3.0 s 99.8% 3.0 m 0.0 s INFO 03:16:35,724 ProgressMeter - done 5.66e+07 3.1 m 3.0 s 99.8% 3.1 m 0.0 s INFO 03:16:35,725 ProgressMeter - Total runtime 184.57 secs, 3.08 min, 0.05 hours INFO 03:16:38,261 GATKRunReport - Uploaded run statistics report to AWS S3

    pdf
    pdf
    allsofar050713.tranches.pdf
    8K
  • morgantaschukmorgantaschuk Posts: 16Member

    Long post short, the VQSR is still dying with insufficient annotation on a full-sized dataset. Note that it does work on some data, but not on this one for some reason.

    java -Xmx14000M -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hg19_random.fa \
        -mode SNP -tranche 99.9 -percentBad 0.01 -minNumBad 1000 \
        -input data/merged_raw.vcf -recalFile data/snp.recal -tranchesFile data/snp.tranches \
        -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf \
        -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg19.vcf \
        -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.vcf \
        -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_137.hg19.vcf \
        -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum \
        -et NO_ET -K GATK.key \
        -nt 8
    ...
    INFO  11:59:50,521 ProgressMeter -   chr2:90166772        6.08e+06   30.0 s        5.0 s     10.8%         4.6 m     4.1 m
    INFO  12:00:20,523 ProgressMeter -  chr4:155311684        1.58e+07   60.0 s        3.0 s     27.0%         3.7 m     2.7 m
    INFO  12:00:50,525 ProgressMeter -  chr7:117044484        2.57e+07   90.0 s        3.0 s     43.1%         3.5 m   119.0 s
    INFO  12:01:20,528 ProgressMeter -  chr11:39300768        3.55e+07  120.0 s        3.0 s     59.1%         3.4 m    82.0 s
    INFO  12:01:50,530 ProgressMeter -  chr15:98024379        4.51e+07    2.5 m        3.0 s     76.7%         3.3 m    45.0 s
    INFO  12:02:20,532 ProgressMeter -   chrX:80525476        5.53e+07    3.0 m        3.0 s     94.4%         3.2 m    10.0 s
    INFO  12:02:24,216 VariantDataManager - DP:      mean = NaN      standard deviation = NaN
    ...
    ##### ERROR MESSAGE: Bad input: Values for DP annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinstitute.org/discussion/49/using-variant-annotator
    

    I can confirm that this is a full-sized whole genome with 3988356 variants. Every variant has DP annotation, and these variants have thousands of overlapping variants with each of the training sets (hapmap 3.3, 1000G omni 2.5, 1000G phase 1 SNPs and dbSNP 137), as calculated with vcftools.

    I'm slightly concerned that it only took 3 minutes to get through the 4 million variants but I don't know if that's unusual. I also tried it with a lower tranche (90.0) and it just failed slightly faster.

    Do you have any suggestions for what I can do? Thanks again in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi @mimi_lupton, the fact that you're getting the "very few variants" warning means that the results may not be reliable, so you need to be careful in the evaluation. Since you're saying only 9 variants got filtered out it sounds like you chose the highest sensitivity tranche for the "ApplyRecalibration" step. If you choose a lower sensitivity tranche you can improve the specificity. The second graph in the PDF is perhaps the most helpful because it shows the tradeoff between sensitivity and specificity by tranche. So, I would recommend you try choosing a different tranche and compare which variants get filtered then with those that get filtered with the hard filters. Does that make sense?

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    @morgantaschuk, I've rescued your posts from the spam filter and confirmed you as a legitimate user, so you should no longer experience any such issues.

    I'm slightly concerned that it only took 3 minutes to whip through those 4 million variants in my input dataset. Am I specifying something wrong?

    Well, this is new, a complaint that the GATK is running too fast :-)

    Every variant has DP annotation

    Can you confirm that they have the FORMAT field DP annotation, not just the INFO one?

    Geraldine Van der Auwera, PhD

  • morgantaschukmorgantaschuk Posts: 16Member

    Oh, I see, so it needs to be in the genotype columns? You are correct, the DP annotation is only in the INFO column.

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  PANC_R
    chrX    259221  .       GA    G       136.74  .       AC=2;AF=1.00;AN=2;DP=15;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=8.82;MQ0=1;QD=3.04       GT:AD:GQ:PL     1/1:0,2:6:164,6,0
    

    I will run the Annotator on it and see if that helps.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Sorry, the recalibrator actually does use the INFO-field DP, my mistake.

    As @pdexheimer pointed out in your other thread, this could be an issue with your VCF being malformed. How was the VCF generated? If by a GATK tool, can you tell me which version?

    Geraldine Van der Auwera, PhD

  • mimi_luptonmimi_lupton Posts: 8Member

    @Geraldine_VdAuwera said: Hi mimi_lupton, the fact that you're getting the "very few variants" warning means that the results may not be reliable, so you need to be careful in the evaluation. Since you're saying only 9 variants got filtered out it sounds like you chose the highest sensitivity tranche for the "ApplyRecalibration" step. If you choose a lower sensitivity tranche you can improve the specificity. The second graph in the PDF is perhaps the most helpful because it shows the tradeoff between sensitivity and specificity by tranche. So, I would recommend you try choosing a different tranche and compare which variants get filtered then with those that get filtered with the hard filters. Does that make sense?

    Hi Geraldine, Yes. I have tried using both ts_filter_level 95 and 90 It filters out 121 variants with level 95 and 525 variants with level 90, out of 5978 SNPs. If I compare it the the hard filtering of 1433 SNPs, 8% of the level 95 SNPs overlap with these, and 22% of the level 90 SNPs overlap with these.

    What would you recommend? Should I try some other levels, or is if saver to stick with the hard filters because it is not reliable with this number of SNPs.

    Many thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    It's difficult to say, it really depends on the dataset. Ideally you should have a look at subsets of variants in the various filtered sets, and get a sense of the proportions of variants that look legit vs. artifactual. That should tell you more about the error modes in your data and inform your decision about which filters are most appropriate.

    Geraldine Van der Auwera, PhD

  • morgantaschukmorgantaschuk Posts: 16Member

    I got it to run!! I'm going to go through my method in case anyone else runs into the same series of problems.

    Thanks to @pdexheimer and @Geraldine_VdAuwera, I took a closer look at my input file. With 4 million variants, it was hard to tell if something in the middle was badly formatted, so I separated it out by chromosome and took a stab at chromosome 2.

    The first error on this chromosome that I ran into was "ERROR MESSAGE: Line 15139: there aren't enough columns for line (we expected 9 tokens, and saw 1 )". Every time I ran it, I saw a different line number. I've run into this problem before when I pass GATK insufficient memory, but the line number didn't change significantly between 8G and 16G RAM so I decided that wasn't the problem. I deleted the Tribble index on the off-chance that was the problem.

    In re-generating the index, I came across a different error: "ERROR MESSAGE: Input file must have contiguous chromosomes. Saw feature chr2:11679-11679 followed later by chr2:114343912-114343912 and then chr2:114343971-114343971, for input source: chr2.vcf"

    I went looking for these variants and encountered something strange:

    ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@chr2      114343912       .       C       G       1217.77 .       AC=1;AF=0.500;AN=2;BaseQRankSum=2.928;ClippingRankSum=-2.356;DP=70;FS=4.232;MLEAC=1;MLEAF=0.500;MQ=29.36;MQ0=0;MQRan
    

    For some reason, some of our split-by-chromosome files had some binary characters in them. When we merged them back together and sorted them, the mimetype and charset changed from "txt/plain; us-ascii" to "txt/plain; binary" and the resulting file had incorrect formatting because of these binary characters, which made VQSR explode a little inside.

    Through much trial and error, we determined that we should remove the binary characters with the following Unix command line before the charset gets messed up.

    tr -cd '\11\12\15\40-\176' < chr2.vcf > chr2s.vcf
    

    It runs quick so we might just pass all of our files through it in the future to make sure this kind of thing doesn't happen again! To answer any other questions, we used GATK 2.4-9 HaplotypeCaller parallelized over chromosomes. The parallelization happens at the BAM file stage. The actual command that converts the charset is GNU sort, which I guess encounters the binary characters and "helpfully" converts the final file's charset.

    Thanks again for everyone's help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Thanks for reporting your solution, @morgantaschuk! This will surely be helpful for other users.

    Geraldine Van der Auwera, PhD

  • neilwneilw Posts: 21Member

    Hey there, I just wanted to point out that you have "truth=true" for omni in the SNP recommendations above. Was that intentional? It seems to conflict with what's listed in the documentation for VariantRecalibrator and elsewhere.

    Also, as a side note. With regard to the previous two posts:

    If your VCF has a bunch of ASCII zeroes (which print as ^@) in it, then it seems rather likely that something has gone horribly, horribly wrong. I've most often seen this when I've restarted an experiment and accidentally left the original one running, so that two processes were writing to the same file at the same time. Using "tr" to remove the most obvious symptom of the problem (the zeroes) doesn't fix the underlying problem.

    I'm sure that in this case, @morgantaschuk has inspected his or her data to ensure that all of the expected lines are there, but certainly this can't be touted as a general solution to a common problem. If your data has trash in the middle of it, you really ought to figure out why -- not just use 'tr' to remove the trash.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Fair point, @neilw. We do generally assume that users will do their due diligence in inspecting their files and be careful about manipulating them with what amounts to fairly blunt instruments... but you are right to sound a note of caution explicitly about the risks involved. I do hope that anyone reading this thread will take your comment into careful consideration.

    Regarding the Omni sites, this document should be considered the ultimate correct reference in case of inconsistencies with any other docs. Values given in all other docs should be considered to be examples only. I'll try to clarify that in the other docs.

    Geraldine Van der Auwera, PhD

  • neilwneilw Posts: 21Member

    Thanks, @Geraldine_VdAuwera -- I don't think that you need to clarify that elsewhere. It's always been clear that this was the definitive resource, but I was concerned that this was a potential typo because (as far as I know) it changed here, too. I'll happily adopt what's here.

  • morgantaschukmorgantaschuk Posts: 16Member

    @neilw, @Geraldine_VdAuwera: We have since discovered that we were having some hardware and network problems at the time of my last post that most likely affected my output, so that was probably the cause of the mysterious zeroes. Indeed, applying the 'tr' command universally is not the correct solution. Facing a week of re-analysis to re-generate the file (which was just a testing set anyway), I chose this slightly hackish solution to complete my analysis. Thanks again for your help. It was very much appreciated!

  • OprahOprah Posts: 21Member

    Please update the recommendations for those using v2.7 -percentBad is deprecated. Is -numBad 1000 recommended for both exome and whole genome experiments? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi @Oprah,

    Thanks for pointing this out, we will correct it. FYI we are working on an updated version of the recommendations to reflect the changes in the latest version (largely related to the negative training model). We'll try to have that out within a day or two tops.

    Geraldine Van der Auwera, PhD

  • OprahOprah Posts: 21Member

    Will there be an update this week?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi @Oprah,

    I have fixed the recommendations to reflect the argument update. For now we're keeping the same recommended default values, so you can proceed with these if that's what you're asking.

    In the near future (next few weeks) we plan to update the documentation with additional details on how to evaluate the results you get from VQSR and guidelines on how to tweak the values of various arguments to suit your dataset.

    Geraldine Van der Auwera, PhD

  • OprahOprah Posts: 21Member

    I have a vcf file with ~19 million variants (~100 whole genomes, human). Under mode SNP, I got an error message with a suggestion to increase numBad to 3000, which worked fine. Then I tried 6000, 60000, and 1200000, and I liked the slight Ti/Tv improvement shown on the plot. Should I use the same number under mode INDEL ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    With that amount of data you can definitely afford to use larger numbers than the defaults for numBad, sure. You might not want to go to quite as high numbers with indels as with SNPs, but it's worth experimenting a little to find the numbers that work best for your data.

    Geraldine Van der Auwera, PhD

  • mlindermmlinderm Posts: 18Member

    I am updating my VQSR configuration and noticed that the suggested annotations had changed. MQ is gone from SNPs, QD is gone from indels, and MQRankSum seems to have been added for indels. However the following paragraph seems to contradict the introduction of MQRankSum. Is this just a case of the documentation being out of sync, and the listed annotations are indeed the recommendations or did something else get mixed up in the editing. While the argument around MQ makes intuitive sense, I could imagine that it does not bear out in experiment.

    Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    IIRC MQRankSum is not subject to the same issues as MQ, but I will check with @rpoplin to make sure.

    Geraldine Van der Auwera, PhD

  • nreidnreid UC DavisPosts: 2Member

    I'm not sure if this is the right place to ask this... How large do the training/truth data sets have to be? I have a low pass dataset (6x coverage) of 96 killifish and no prior training/truth datasets. I do have some rad-seq data from an on-going mapping project that I think I could use to generate some very high confidence variants (by using parent-offspring relationships as confirmation). That would only generate at an absolute maximum a few tens of thousands of variants. Would that be enough to do VQSR? Do you have any recommendations? Thanks very much.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    It's hard to give an exact number but that sounds like a good start. Ultimately the number that's going to be important is how many variants in your cohort overlap with the training set. If a large proportion overlaps, then I think you should be okay. Only way to know for sure is to try it; in any case I expect it will be worthwhile for you to put together that set of known variants because it will come in handy at other steps of your analysis.

    Geraldine Van der Auwera, PhD

  • nikhil_joshinikhil_joshi Posts: 12Member
    edited October 2013

    So I have used the latest GATK best practices pipeline for variant detection on non-human organisms, but now I am trying to do it for human data. I downloaded the Broad bundle and I was able to run all of the steps up to and including ApplyRecalibration. However, now I am not exactly sure what to do. The VCF file that is generated contains these FILTER values:

    .

    PASS

    VQSRTrancheSNP99.90to100.00

    I am not sure what these mean. Does the "VQSRTrancheSNP99.90to100.00" filter mean that that SNP falls below the specified truth sensitivity level? Does "PASS" mean that it is above that level? Or is it vice versa? And what does "." mean?

    I'm also having some difficulty fully understanding how the VQSLOD is used.... and what does the "culprit" mean when the filter is "PASS"?

    A final question.... I've been using this command to actually create a file with only SNPs that PASSed the filter:

    java -Xmx2g -jar /share/apps/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T SelectVariants -R ~/broad_bundle/ucsc.hg19.fasta --variant Pt1.40300.output.recal_and_filtered.snps.chr1.vcf -o Pt1.40300.output.recal_and_filtered.passed.snps.chr1.vcf -select 'vc.isNotFiltered()'

    Is this the correct way to get PASSed SNPs? Is there a better way? Any help you can give me would be highly appreciated. Thanks!

    • Nikhil Joshi
    Post edited by nikhil_joshi on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi Nikhil,

    In brief: during the recalibration process, variants in your callset are given a score called VQSLOD. At the same time, variants in your training sets are also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.90), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives. Does that make sense?

    For the rest, "culprit" is just the annotation for which that variant got the lowest ranking, and a dot in the FILTER field means that no filters were applied to that variant (e.g. indels in a file for which you have only recalibrated SNPs, not indels yet).

    Finally, if you want to get only passing SNPs out of the recalibration, you can use the --excludeFiltered flag.

    Geraldine Van der Auwera, PhD

  • nikhil_joshinikhil_joshi Posts: 12Member

    Hi Geraldine,

    Thank you for your response. I apologize for the double posting, I thought I had posted in the wrong place the first time. So I understand your explanations, but I still have some results that I don't understand. You said that a dot in the FILTER field means that no filters were applied, however, what does it mean if the variant is a SNP? I kept only SNPs before doing the recalibration step, so all of those FILTER values are for SNPs.

    • Nik.
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi Nik, can you post the record line?

    Geraldine Van der Auwera, PhD

  • nikhil_joshinikhil_joshi Posts: 12Member

    Hi Geraldine,

    So I am looking at the lines, and they are indels. So it looks like where I went wrong is that I assumed when I gave the "-mode SNP" option to ApplyRecalibration, that it would only output SNPs. However, it seems that it outputs everything, but just doesn't do any filtering or recalibration for indels.... is that correct? So will the "--excludeFiltered" flag only output SNPs that passed and take out all indels and filtered SNPs?

    • Nik.
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Yes and no. Yes, it's normal that when you run in SNP mode, indels are copied over untouched, and vice-versa; no, --excludeFiltered will not take out the indels since no filters have been applied to them yet.

    Have a look at the howto/tutorial article on VQSR, this is explained in more detail there.

    Geraldine Van der Auwera, PhD

  • NikTuzovNikTuzov Posts: 9Member

    Hi Geraldine,

    Could you please elaborate on how you determine the priors for the resource datasets. The reason I am asking is twofold:

    1) I have my own resource set (obtained from a study) whose quality is probably much higher than that of HapMap, but I have no clue how to specify a prior except that it should be higher than Q15;

    2) I noticed that in this posting you assign Q2 prior to dbSNP, whereas here:

    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantrecalibration_VariantRecalibrator.html

    dbSNP is used with a prior of Q6. What was the reason for adjustment and how did you determine it had to be 4 units as opposed to 3 or 6?

    Thanks, Nik

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi Nik,

    The priors are determined empirically to a large extent. They represent the confidence you have that the variants in your resource file are true; in other words it's the expected proportion of true variants in the resource. We use validation data from orthogonal sources to estimate these numbers.

    For dbsnp, we downgraded the confidence because the more we checked the validity of sites from dbsnp, the less confident we were that we could rely on them. Eventually we settled on Q2 because that is the value we generally use to signify that variants should not be used in any kind of calculation. It is a sort of symbolic value. So it's not really that we took away a certain number of units, rather we went straight for the symbolic value once we had decided dbsnp variants were useless for modeling. dbSNP is only used to annotate known vs. novel status, they are not used at all in the model calculations.

    Note that the examples given a tech doc are not representative of our latest Best Practices recommendations; they are only given to illustrate general usage of the tools.

    Geraldine Van der Auwera, PhD

  • NikTuzovNikTuzov Posts: 9Member

    While dbSNP is relatively unreliable, I am not sure why it was necessary to exclude it from the training. E.g. imagine two scenarios:

    1) A variant caller calls a variant that is found in dbSNP;

    2) Exactly the same as 1), only the variant is not found in dbSNP;

    It looks like I should be a lot more confident about the call in case 1), i.e. the information from dbSNP should be useful for training the model.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    But using dbSNP introduces a lot of noise due to false positives. Meanwhile, our other resources contain most if not all the high confidence variants that would be useful in dbSNP. Using them and excluding dbSNP allows us to use a purer signal to train the model.

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen Posts: 49Member
    edited February 26

    @ik5 said: I'm trying to use VariantRecalibrator on an exome dataset. I reduced the bams and created the vcf files using UnifiedGenotyper. The SNP calling works great but the indel calling gives an error. I'm using the Mills_and_1000G_gold_standard.indels.b37.vcf that was recomended.

    But it gives me the following error " ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations."

    @cge said: Hi, I`m finding with the INDELS the same error that IK5 found. With SNPs I have not many problems. My vcf file has been generated using UnifiedGenotyper from 42 BAM files, from a target resequencing experiment of a single gene.

    @morgantaschuk said: And the error: Values for DP annotation not detected for ANY training variant in the input callset

    Following recommendations here, I checked my input dataset, and there are DP annotations on 13/251 lines. However, neither Mills_and_1000G_gold_standard.indels.hg19.vcf nor 1000G_phase1.indels.hg19.vcf have any DP annotations. DBsnp has DP annotations. The training sets are direct from the resource bundle. Do you have any tips on how to proceed?

    @ik5 @cge @morgantaschuk I had the same problem as you. I got the same error message as you: Bad input: Values for DP annotation not detected for ANY training variant in the input callset.

    In my case the problem was the training dataset: ftp.broadinstitute.org/bundle/2.8/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz

    Your error messages led me here (problem also mentioned here: http://seqanswers.com/forums/archive/index.php/t-18563.html and here: http://seqanswers.com/forums/showthread.php?t=18309), so I will post my solution here.

    EDITED:

    The problem was that I was using the wrong version of the Mills vcf; I was using the hg19 version (which has the 'chr' prefix) instead of the b37 version, to match with my callset. Hence there was no overlap between the Mills training data and my called INDELs. The solution is simply to use the correct reference version of the Mills vcf.

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    @tommycarstensen It sounds like you were using the wrong reference version of the Mills vcf, hg19 instead of b37. What you're describing is the wrong way to do a liftover, unless I'm missing something (?).

    Geraldine Van der Auwera, PhD

  • tommycarstensentommycarstensen Posts: 49Member

    @Geraldine_VdAuwera said: tommycarstensen It sounds like you were using the wrong reference version of the Mills vcf, hg19 instead of b37. What you're describing is the wrong way to do a liftover, unless I'm missing something (?).

    My bad. Of course it is an incorrect way to do a liftover. I should of course have downloaded b37 from ftp.broadinstitute.org/bundle/2.8/b37. I can't edit the post now. But it still holds true, that other users could have received the error we all have in common because of using a wrong vcf in the GATK bundle. Feel free to delete my post if possible, so others don't do an incorrect liftover.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    other users could have received the error we all have in common because of using a wrong vcf in the GATK bundle

    That is a very good point and definitely worth having on the record, so I'll leave your post up, but with the solution edited.

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 36Member

    So I am admittedly still using UG version 2.8-1, but now that --percentBad is no longer an argument is there anything that can be done about the 'NaN LOD value assigned' error? I am trying to VQSR 39 high coverage exome samples and the SNP recal works just fine, but the indels gives me the error. Is there any new recommendation for this error?

    Otherwise I guess my next option would be to apply the hard filters for indels?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    @bwubb No, nothing new on that front I'm afraid. We're working on a new implementation of VQSR that we think won't suffer from the same limitations. The downside is that we're not putting any more effort into improving the old implementation, so in the meantime you're stuck with hard-filtering indels, indeed.

    Geraldine Van der Auwera, PhD

  • yeamanyeaman Posts: 12Member

    Hi there,

    I'm wondering if you have any recommendations about our study? We are working with a large exon-capture in conifers and have lots of individuals (>500) but no known SNPs. We created a dbsnp which we used for BQSR by filtering using "VariantFiltration" with these cutoffs:

    "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0"

    This gave us a database with ~100,000,000 putative variants after filtering out those that didn't pass. Is it appropriate to use this for VQSR? Or should we try to filter it more stringently? (e.g. filter out low-frequency variants?) Or should we just do hard filtering?

    Also, our genome is currently highly fragmented, so we joined contigs together into 1000 pseudo-scaffolds and have been running analysis in parallel on these chunks to speed things up. Is the smaller size of these pseudo-scaffolds likely to compromise how well BQSR or VQSR work, given that we are running each one independently? (some scaffolds have only a few thousand SNPs, others have ~100,000).

    Thanks for your help maintaining this excellent forum. It's very helpful!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi @yeaman‌,

    VQSR requires that you have a pretty decent idea of the rate of false positives on your reference callset, so it would take a bit more validation work before you could confidently use your callset for VQSR, I think. I would recommend sticking with hard-filtering for now, although eventually this sounds like a project that can eventually lead to producing a good known sites resource.

    I'm not sure what impact the fragmentation might have, but I would expect a lot of false positives around the joining points within the pseudo-scaffolds. Also, running in parallel instead of globally will definitely compromise statistical power, because the modeling works much better when the program sees all the data. You should run BQSR per-lane of data; and VQSR over all the data if you can handle the compute requirements.

    Geraldine Van der Auwera, PhD

  • yeamanyeaman Posts: 12Member

    Thanks very much for the recommendations! (and congratulations on your 5003rd post! That's a lot of helping!).

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Oh my, 5003 already? I guess I'm lucky to have found a job where spending most if not all my time on internet forums is actually viewed as a good thing :)

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.