It looks like you're new here. If you want to get involved, click one of these buttons!
The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R path/to/reference/human_g1k_v37.fasta \ -input raw.input.vcf \ -recalFile path/to/output.recal \ -tranchesFile path/to/output.tranches \ -nt 4 \ [SPECIFY TRUTH AND TRAINING SETS] \ [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \ [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \
For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle. Arguments for VariantRecalibrator command:
-percentBad 0.01 -minNumBad 1000 \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \ -an QD -an MQRankSum -an ReadPosRankSum -an FS -an DP \ -mode SNP \
Note that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.
Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.
Some of these annotations might not be the best for your particular dataset. For example, InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. If your study design has more than 10 samples then it is recommended to be included.
Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with hybrid capture datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.
Additionally, the UnifiedGenotyper produces a statistic called the HaplotypeScore which should be used for SNPs. This statistic isn't necessary for the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes.
In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:
When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle. Arguments for VariantRecalibrator:
--maxGaussians 4 -percentBad 0.01 -minNumBad 1000 \ -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \ -an DP -an FS -an ReadPosRankSum -an MQRankSum \ -mode INDEL \
Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.
InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. If your study design has more than 10 samples then it is recommended to be included.
The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.
java -Xmx3g -jar GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R reference/human_g1k_v37.fasta \ -input raw.input.vcf \ -tranchesFile path/to/input.tranches \ -recalFile path/to/input.recal \ -o path/to/output.recalibrated.filtered.vcf \ [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \ [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \
For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. The default recommendation is to achieve 99.9% sensitivity to the accessible truth sites. Naturally projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity.
--ts_filter_level 99.9 \ -mode SNP \
For indels we use the Mills / 1000 Genomes indel truth set described above. The default recommendation is to achieve 99.9% sensitivity to the accessible truth sites. Naturally projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity.
--ts_filter_level 99.9 \ -mode INDEL \
Comments
I have a very small dataset (12 samples, only ~80K variants) and I found that even using the Mills gold standard indel resource, I was still unable to run VariantRecalibrator on indels. Is there any solution to this?
My call:
Some of the output:
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Eric,
Have you looked at the VQSR doc in Methods & Workflows? At the end there is a small collection of FAQs, one question being that of small datasets:
Let me know if that little tweak still doesn't help.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Where can I find the raw vcf for 1000G exomes to be used in combination with my small number of exomes for variantrecalibrator? If the raw vcfs are not available for 1000G exomes and I should call the variants from the BAM files myself, can I use the 1000G exomes' vcf called by Unifiedgenotyper with our own samples' vcfs called by haplotypecaller for variantrecalibrator? Thanks.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •The recommendations say "Add additional samples for variant calling" not for recalibration. In other words you need to do the calling simultaneously with your own samples plus the 1000 Genomes bams. You cannot just merge together VCFs.
Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks for your answer. But it might not be practical to call variants for extra 20 1000G samples (to have the 30 in total) each time specially with haplotypecaller. It will take forever.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •We are actively working to speed up the run time of the Haplotype Caller. In the meantime, if run time is a problem for you, then you should use the Unified Genotyper.
Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •In fact the HaplotypeCaller's runtime has been significantly improved already in the new release -- check out this overview:
http://www.broadinstitute.org/gatk/guide/version-history
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks for your note. I just downloaded the 2.2 release and I am running the haplotypecaller now and I can attest that it is at least 2 fold faster on my system. But what amazed me the most is the multi threading in baserecalibrator. I ran it with 8 cores and it is amazingly fast. Thumbs up for the team who worked on these! Do you know if multi threading could be used in this release for indelrealigner as well?
About using Unified Genotyper, I should say that I can not go back to it after using haplotypecaller and seeing the large indels it picks and UG misses them, specially after confirming them with Sanger. I just feel guilty if I want to use UG again!
- Spam
- Abuse
- Troll
1 • Off Topic Disagree Agree 1Like WTF •The team is very pleased to hear that you like the improvements :)
Unfortunately you can't use multi-threading with the IndelRealigner at this time, but you can parallelize it with scatter-gather (Queue). We'll have a new document out very soon detailing the parallelism options available for all the main tools.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Akbari, great to hear you are enjoying GATK 2.2! Any chance you could share some screenshots of those indels with us?
-- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I'm trying to use VariantRecalibrator on an exome dataset. I reduced the bams and created the vcf files using UnifiedGenotyper. The SNP calling works great but the indel calling gives an error. I'm using the Mills_and_1000G_gold_standard.indels.b37.vcf that was recomended.
But it gives me the following error " ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations."
Although i checked and the annotationts are in my files i used VariantAnnotator to redo them. But i'm still getting the same error running VariantRecalibrator on indels(SNP mode works fine).
Has anyone else had problems using Mills_and_1000G_gold_standard.indels.b37.vcf for indel calling? thanks
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there, can you tell me what version you are using and what is your command line?
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Afcourse im using the latest version GenomeAnalysisTK-2.2-8
i've added my code for indel and SNPs below.
Indel calling:
SNP calling:
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Well, the most likely explanation is that there is no overlap between your training dataset and your callset. This can happen if your exome dataset is quite small. You may need to use hard filtering instead of VQSR; see our Best Practices recommendations for details.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, concerning the recommendation to include > 30 bam files in an exome capture experiment: does the selection of those extra samples (from the 1000 Genomes Project) require special attention when analysing a single sample, e.g., in a diagnostic setting? In particular, is there any danger of (1) introducing a bias in the variant calling by including a set of bam files that is biased towards a particular population ; or (2) suppressing de novo mutations and other events that occur in the sample of interest but not in the extended (the bigger the better?) set?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •To be honest, we're not sure because we haven't had the opportunity to test this extensively on different datasets. That recommendation stems from an experiment we did over a year ago, in which we found that it was better to add 30 samples from 1000 Genomes to the mix than to call one exome alone. The extra samples were not selected with any particular criteria. But we would have to do more experiments before we can say anything more concrete. In the meantime I would recommend that you try it that way (adding samples without special attention to the selection) and then also with hard filters, and then decide for yourself which callset looks less biased. Please feel free to share your observations on the forum, as I'm sure others in the community would be interested to benefit from your experience.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine, Thanks for your answer, good to know this issue is unresolved. We will see if we can learn something from experiments with in house and 1000 Genomes mixes and share whatever we find. In addition, it would be very helpful for setting up the experiments if GATK-developers, based on their understanding of the algorithms, could share their ideas about the steps where the extra samples could make a difference and how they might impact the outcome. --luc
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Luc,
What it comes down to is VQSR's machine learning model. The principles are explained in the documentation, but I agree it's difficult to grasp the deeper implications. It's also something that's difficult to explain in written texts. For that reason, we are planning to hold live Q&A sessions in which users like yourself will be able to ask questions and have the developers answer them in real-time. We're hoping this might help answer these kinds of questions.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I'm trying to call variants for exomes which are from pooled samples. We have 40 pools which each is from a pool of exomes from 10 individuals. I used UG with -ploidy 20 parameter to call each pool and run hard-filtering. I'm seeking advice on whether it is better to merge all recalibrated BAMs from all pools and use UG to call SNPs using "-ploidy 800". Then I can use VQSR for filtering, right? Any suggestions?
Thanks a lot! Jerry
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Don't call with -ploidy 800! Most probably the calls will take forever and your jobs will die from lack of memory. You should definitely call all BAMs together but do multi-sample calling and keeping the -ploidy 20 parameter which is the number of chromosomes per pool. You will then get a single vcf with all your samples but with separate genotypes for each pool (in the pooled case, the "genotype" will be the most likely # of alt alleles in each pool). I ASSUME you can use VQSR but honestly we've never tried it with pools of data so we have no idea what kind of results it will produce, nor what the best practices are in this case.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •On: http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr#latest You recommend the use of the HRun annotation:
But HRun is not included in any of the recommendations here. Is one resource out of date? Or are there specific datasets where HRun is appropriate and others where it's not.
Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Yes, the doc is a little out of date; we no longer use HRun.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I`m finding with the INDELS the same error that IK5 found. With SNPs I have not many problems. My vcf file has been generated using UnifiedGenotyper from 42 BAM files, from a target resequencing experiment of a single gene. What are the ranges values for maxGaussian, percendBAd and std that could I try to set? Thank you very much.
Here are the command:
java -Xmx24g -jar ${GATK} \ -T VariantRecalibrator \ -R $REF \ -input all_bis_mark_dup.indel.rc.bam.raw.vcf \ -resource:mills,known=true,training=true,truth=true,prior=12.0 $ref_dir/Mills_and_1000G_gold_standard.indels.b37.vcf \ -an QD -an FS -an HaplotypeScore -an ReadPosRankSum -an InbreedingCoeff \ --maxGaussians 4 -percentBad 0.05 -std 10 \ -mode INDEL \ -L 13 \ -recalFile $final_dir/output_INDEL.recal \ -tranchesFile $final_dir/output_INDEL.tranches \ -rscriptFile $final_dir/output.plots_INDEL.R
Here are the error message:
INFO 14:30:24,217 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:30:24,220 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.2-16-g9f648cb, Compiled 2012/12/04 03:46:58 INFO 14:30:24,220 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 14:30:24,220 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 14:30:24,226 HelpFormatter - Program Args: -T VariantRecalibrator -R /users04/ipiras/ref/Homo_sapiens.GRCh37.62.fa -input all_bis_mark_dup.indel.rc.bam.raw.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 /ref/Mills_and_1000G_gold_standard.indels.b37.vcf -an QD -an FS -an HaplotypeScore -an ReadPosRankSum -an InbreedingCoeff --maxGaussians 1 -percentBad 0.000001 -std 10 -mode INDEL -L 13 -recalFile /final_output/output_INDEL.recal -tranchesFile /final_output/output_INDEL.tranches -rscriptFile /final_output/output.plots_INDEL.R INFO 14:30:24,227 HelpFormatter - Date/Time: 2013/02/26 14:30:24 INFO 14:30:24,227 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:30:24,227 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:30:24,250 ArgumentTypeDescriptor - Dynamically determined type of all_bis_mark_dup.indel.rc.bam.raw.vcf to be VCF INFO 14:30:24,269 ArgumentTypeDescriptor - Dynamically determined type of /users04/ipiras/ref/Mills_and_1000G_gold_standard.indels.b37.vcf to be VCF INFO 14:30:24,284 GenomeAnalysisEngine - Strictness is SILENT INFO 14:30:24,453 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE Target Coverage: 1000
INFO 14:30:24,585 RMDTrackBuilder - Loading Tribble index from disk for file all_bis_mark_dup.indel.rc.bam.raw.vcf INFO 14:30:24,689 RMDTrackBuilder - Loading Tribble index from disk for file /users04/ipiras/ref/Mills_and_1000G_gold_standard.indels.b37.vcf WARN 14:30:24,793 VCFStandardHeaderLines$Standards - Repairing standard header line for field GQ because -- type disagree; header has Float but standard is Integer INFO 14:30:24,808 GenomeAnalysisEngine - Processing 115169878 bp from intervals INFO 14:30:24,824 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 14:30:24,824 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining WARN 14:30:24,888 Utils - ******************************************************************************** WARN 14:30:24,889 Utils - * WARNING: WARN 14:30:24,889 Utils - * WARN 14:30:24,889 Utils - * Rscript not found in environment path. WARN 14:30:24,889 Utils - * /huentelman/057_KIBRA_RD_Nextera_6_24_11/FASTQs/Project_KIBRA_RD_Nexte WARN 14:30:24,889 Utils - * a/final_output/output.plots_INDEL.R will be generated but PDF plots WARN 14:30:24,890 Utils - * will not. WARN 14:30:24,890 Utils - ******************************************************************************** INFO 14:30:24,894 TrainingSet - Found mills track: Known = true Training = true Truth = true Prior = Q12.0 INFO 14:30:28,472 VariantDataManager - QD: mean = NaN standard deviation = NaN INFO 14:30:29,705 GATKRunReport - Uploaded run statistics report to AWS S3
ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.2-16-g9f648cb):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://gatkforums.broadinstitute.org/discussion/49/using-variant-annotator
ERROR ------------------------------------------------------------------------------------------
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •The error you're reporting here has nothing to do with value ranges; it seems you're missing annotations. See the error message for more information.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Given that in 2.4 the Haplotype Score annotation no longer applies to indels. Could we get an updated recommendation for the INDEL command line please? Thank you
Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •For now, just remove the HS annotation from the command line; the rest still applies in the same way.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi I'm analyzing a single sample, high coverage whole genome seq data (Illumina at ~ x60). Do you have any recommendations for variant re-calibration here? Is it appropriate for this scenario? I did variant calling using this single sample with the unified genotyper (Haplotype caller is still running ...)
thanks Barak
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I could be missing something, but the documentation above refers to
hapmap_3.3.b37.sites.vcfand1000G_omni2.5.b37.sites.vcfand suggests:I think the files in the GATK resource bundle
hapmap_3.3.b37.vcf.gzand1000G_omni2.5.b37.vcf.gzare sites-only files, but not named as such. If so, the documentation would be clearer if it used the same file names as the resource bundle.- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •That's a fair point, we'll try to homogenize the names between the docs and the bundle files. Thanks for pointing this out.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi all, this page was updated in April 2013 to reflect the results of many experiments run at the Broad to provide the best possible recommendations to the community for how best to use the VQSR in conjunction with the current version of the UnifiedGenotyper and HaplotypeCaller.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, the file mentioned above '1000G_phase1.snps.high_confidence.vcf' was not found at ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.3/b37. Where can I found it?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I am also unable to locate the 1000G phase1 high_confidence file
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi folks,
The file is not available yet, sorry. This is due to a slight timing issue between documentation updates and the release process -- the new file will be in the new version of the bundle which will be released with version 2.5 within a few days.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine,
I also get NaN LOD: `
ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)`
But I'm testing on 1 sample which is WGS at 4x, is it likely that the set just isn't big enough?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi @redzengenoist,
Just one sample is a little short, I'm afraid. You should consider supplementing it with some other callset as described in the Best Practices recommendations; we typically use calls from the 1000 Genomes Project.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hello,
I am working with the newly suggested 1000G_phase1.snps.high_confidence.vcf variants to train the error model for my whole exome sequencing experiment. However, when I attempt to run VariantRecalibrator I get the following error:
INFO 09:43:24,634 HelpFormatter - Date/Time: 2013/05/09 09:43:24 INFO 09:43:24,634 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:43:24,634 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:43:24,647 ArgumentTypeDescriptor - Dynamically determined type of snpsUG_NTDplusLGMD_allsamples.raw.vcf to be VCF INFO 09:43:24,650 ArgumentTypeDescriptor - Dynamically determined type of /chg/mitosis1/ngs_references/hapmap_3.3.b37.sites.vcf to be VCF INFO 09:43:24,651 ArgumentTypeDescriptor - Dynamically determined type of /chg/mitosis1/ngs_references/1000G_omni2.5.b37.sites.vcf to be VCF INFO 09:43:24,652 ArgumentTypeDescriptor - Dynamically determined type of /chg/mitosis1/ngs_references/1000G_phase1.snps.high_confidence.hg19.vcf to be VCF INFO 09:43:24,654 ArgumentTypeDescriptor - Dynamically determined type of /chg/mitosis1/ngs_references/dbsnp_137.b37.vcf to be VCF INFO 09:43:25,355 GenomeAnalysisEngine - Strictness is SILENT INFO 09:43:25,463 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 09:43:25,490 RMDTrackBuilder - Loading Tribble index from disk for file snpsUG_NTDplusLGMD_allsamples.raw.vcf INFO 09:43:25,667 RMDTrackBuilder - Loading Tribble index from disk for file /chg/mitosis1/ngs_references/hapmap_3.3.b37.sites.vcf INFO 09:43:25,685 RMDTrackBuilder - Loading Tribble index from disk for file /chg/mitosis1/ngs_references/1000G_omni2.5.b37.sites.vcf INFO 09:43:25,724 RMDTrackBuilder - Creating Tribble index in memory for file /chg/mitosis1/ngs_references/1000G_phase1.snps.high_confidence.hg19.vcf INFO 09:47:23,671 GATKRunReport - Uploaded run statistics report to AWS S3
ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Input files /chg/mitosis1/ngs_references/1000G_phase1.snps.high_confidence.hg19.vcf and reference have incompatible contigs: No overlapping contigs found.
ERROR /chg/mitosis1/ngs_references/1000G_phase1.snps.high_confidence.hg19.vcf contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX]
ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1]
ERROR ------------------------------------------------------------------------------------------
So, is there incosistency between the high_confidence variant list and the reference? I am using human_g1k_v37.fasta
Thanks, any help would be greatly appreciated.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi @bandersoCHG,
You are using the wrong reference -- you must use the same reference as your reads were aligned against. Based on the contig names listed in the error message (and the file name of the variants) you should be using the hg19 reference. You can get it from our resource bundle if you don't have it already.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, It seems that the documentation has been changed dramatically (e.g. exome specific recommendations disappeared and it apparently requires separate VQSR applications to SNPs and INDELs, respectively, even for HaplotypeCaller [how can we handle other types of variants detected by HC?]).
Any help to interpret these changes would be greatly appreciated. Thanks
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi @kyasuno,
The documentation was updated after we re-evaluated the results we were getting with the latest versions of the tools. The changes are too sweeping for us to be able to provide a blow-by-blow explanation of why these changes were made. Ultimately these should simply be taken as our best recommendations for using the tools in their current state. We'd be happy to answer any specific questions you may have about specific points of course.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine,
Sorry for my vague comments.
I'm most interested in how MNPs and block substitutions will be handled if we need to apply VQSR to SNPs and INDELs separately for HaplotypeCaller's call set. In the previous version of recommendation, all the variant types were supposed to be analyzed together, so I assumed that complex variants like MNPs and block substitusions were analyzed properly.
Secondly, it would be great if I can learn some rationale behind the condition "-minNumBad 1000" being added. Does it tell us that training for bad variants using 1000 of worst scoring sites would be sufficient and default value of 2500 is not sensible (or is redundant)? When a specified value of percentBadVariants results in, say, 1500 sites (> 1000 of minNumBad), does VQSR use 1500 sites for training?
Thanks
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
1.) The names of the modes are a little ambiguous. The SNP mode handles SNP variants while the INDEL mode does everything else-- including in the model the MNP and complex variants. In our tests we achieve better results by breaking out the SNPs from the rest of the variants and modeling them separately.
2.) The number of negative training examples is the maximum between the flat number specified by -minNumBad and the number calculated with -percentBadVariants. The value was lowered in the updated best practice recommendations because our testing revealed that there are much fewer false positive raw calls being made in the newer pipeline.
Cheers,
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •