We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
question about odd result for VQSR

Hi,
Recently I run into some odd observation in VQSR. I have 17 samples from a same family and I used all of 17 samples to call SNPs and after VQSR, I got the trench file like this:
Variant quality score tranches file
Version number 5
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,48637,716,2.9527,2.3302,4.8390,VQSRTrancheSNP0.00to90.00,SNP,26182,23563,0.9000
99.00,60114,1531,2.8057,2.3333,1.7766,VQSRTrancheSNP90.00to99.00,SNP,26182,25920,0.9900
99.90,67220,2884,2.7190,1.8222,-10.0009,VQSRTrancheSNP99.00to99.90,SNP,26182,26155,0.9990
100.00,69714,4998,2.6822,1.8300,-1122.0698,VQSRTrancheSNP99.90to100.00,SNP,26182,26182,1.0000
which seems fine. then for research purpose, I only used 5 samples of more tight relation such as two parents and their 3 immediate children and after VQSR, the trench file looks like below:
Variant quality score tranches file
Version number 5
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP0.00to90.00,SNP,20850,20850,1.0000
99.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP90.00to99.00,SNP,20850,20850,1.0000
99.90,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP99.00to99.90,SNP,20850,20850,1.0000
100.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP99.90to100.00,SNP,20850,20850,1.0000
Notice that the 5-sample VQSR tranch file has exactly the same thing throughout all thresholds: 90, 99, 99.90 and 100. and the VQSR modeling plot is also very odd, no plotting at all being seen (the pdf ifle was created but was almost blank in contrast to the normal projection plots I saw in other cases)
However, we did use the old version to call the same 5 samples before, and the trench file looks like below:
Variant quality score tranches file
Version number 4
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,36407,361,2.8657,2.3119,5.0854,TruthSensitivityTranche0.00to90.00,20814,18732,0.9000
99.00,44097,638,2.7655,2.2222,2.2592,TruthSensitivityTranche90.00to99.00,20814,20605,0.9900
99.90,47947,1061,2.7078,1.8750,-7.4143,TruthSensitivityTranche99.00to99.90,20814,20793,0.9990
100.00,50426,2318,2.6645,1.7677,-647.3944,TruthSensitivityTranche99.90to100.00,20814,20814,1.0000
this time, it looks reasonable to me. This is troubling us since for 5 samples, the old version (V1.6-7) seems working fine, whereas the new version (V2.1-13) seems having issue or can not get further filtering by VQSR (90, 99 and 100 got the same result, I did repeat multiple times and got the same results), although for all of the 17 samples, the new version seems fine on VQSR.
So my questions are:
1. is it possible that in some occasion, VQSR can simply not work?
2. Why the old version seems working but not the new version for exactly the same set of 5-sample data?
Thanks a lot for your help!
Mike
Answers
Hi,
After carefully checking, I found more info although still puzzled:
I compared the old and new version commands side by side and realized a major difference in Unified genotyper command, in which the old version use -dcov 50 as default option suggested at the time (I do have the old documnetation for old version saved as in pdf file) and now in new version (V2) it changed to -dcov 200 (-dcov [50 for 4x, 200 for >30x WGS or Whole exome], see URL:http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html) and so for V2, I did use -dcov 200, which caused the odd result as mentioned above, More surprisingly, when I re-run the v2 GATK using UG with -dcov 50 like in old version for the same dataset (just 5 samples), I got similar result as the old version result. the tranch file looks like below, which seems much reasonable in terms of TiTv ratios improvement from 99.9 to 99 to 90 etc:
Variant quality score tranches file
Version number 5
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,35484,316,2.9264,2.1600,4.6455,VQSRTrancheSNP0.00to90.00,SNP,20849,18764,0.9000
99.00,44135,624,2.7757,2.1515,1.9238,VQSRTrancheSNP90.00to99.00,SNP,20849,20640,0.9900
99.90,47937,1007,2.7121,1.8523,-7.7956,VQSRTrancheSNP99.00to99.90,SNP,20849,20828,0.9990
100.00,50469,2133,2.6676,1.7890,-777.6778,VQSRTrancheSNP99.90to100.00,SNP,20849,20849,1.000
Also the VQSR modeling plot looks also normal as I usually saw, whereas the one using UG -dcov 200 give a very odd plot (only one page and also almost no plot at all for the whole pdf file for the model plot)
So my question is: why the option of dcov in UG step (although I used what was suggested by both old and new version documentations) has such big impact on the VQSR later disregard of whether it is old or new version I am running?
Thanks a lot for any insights!
Mike
Hi there,
There seems to be a lot of details here. Can you please post the full command lines that you are using to help us understand what you are seeing.
Cheers,
Hi, Ryan:
Thanks for looking into this.
I just posted the UG steps that I found may be the major causes, let me know if you need other info:
java -jar /opt/nasapps/stow/GenomeAnalysisTK-2.2-4-g4a174fb/bin/GenomeAnalysisTK.jar -T UnifiedGenotyper
-R /path/hg19.fa
-I /path/S1.bam -I /path/S2.bam -I /path/S3.bam -I /path/S4.bam -I /path/S5.bam
--dbsnp /Path/dbsnp_135.hg19.vcf
-glm BOTH
-o /Path/GATK_UG_5Samples_snps_indel.raw.afterRecal_2012_V2.vcf
-stand_call_conf 50 -stand_emit_conf 10 -dcov 200
-L /Path/029720_D_BED_20101013.bed
After this, I used SelectVariants to get SNPs only vcf file, which was subjected to VQSR step, then I got the odd VQSR tranches file as shown above
However, for the same set of 5 samples, for UG step like above command, I just use -dcov 50 instead of -dcov 200, then I got the reasonable tranches file shown above (in my 2nd post).
Is it odd? Did I do something wrong here?
Thanks
Mike
Hi there,
I see that you are using an intervals list:
-L /Path/029720_D_BED_20101013.bed
Is this by any chance a targeted exome or smaller project? We don't recommend using the VQSR on such projects with small numbers of samples. My guess is that the fluctuations you are seeing don't have anything to do with the downsampling in the UG but are rather a sign of numerical instability in the VQSR with small numbers of variants. Take a look at our best practice recommendations for advice on how you might be able to use the VQSR with your project.
I hope that helps,
Hi, Ryan:
Thanks for the input!
yes, it is a small project but as whole exome seq (the bed file is the whole exome target regions from agilent) with just a family of individuals with some of them with a type of cancer suspected to be inheritance-related or genetic cause. I double check the documentation of VQSR, which emphasize on the number of variant sites in the callsets at range of thousands of variants (except for using InbreedingCoeff needs at least 10 samples), and my dataset does have about 50k variants after UG for only 5 samples ( a subset of the family with Trio relations) and 70 k for all members of the big family (17), so I though it is reasonable to use VQSR, does it? I did on just 5 of them is because these 5 samples are with trio relation, we just want to see any improvement of the calls. I did do the variant calls and VQSR on all 17 samples of the whole family as well. which looks fine. So even if I have 50k variants in the 5-sample callset, the number of samples as 5 still not good enough?
I just wonder why even for 5 sample callset, after using -docv 50 on UG step, and it seems now that VQSR has the separation for 0.90 and 0.99 thresholds as mentioned above, compared to if use -dcov 200?
Any insight?
Thanks again
Mike
Hi, Ryan:
I did try VQSR with --maxGaussians 4 (used to be --maxGaussians 6), now I got tranches file as below, which looks fine with me now:
did not realize that --maxGaussians has such great impact on this. What do you think?
Thanks and best
Mike
I have a question about the VQSR's tranches output. I've been scouring the GATK pages and I noticed that my tranches plot does not quite behave the way it is described in the howto:
My plots appear to be behaving in the opposite manner, or maybe I'm just misunderstanding what is meant by "the lowest tranche"?
My tranches file:
# Version number 5
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,0,7954491,0.0000,1.9963,10.6484,VQSRTrancheSNP0.00to90.00,SNP,3856587,3470928,0.9000
99.00,0,9848620,0.0000,2.0676,8.4217,VQSRTrancheSNP90.00to99.00,SNP,3856587,3818021,0.9900
99.90,0,11022663,0.0000,2.1288,6.0801,VQSRTrancheSNP99.00to99.90,SNP,3856587,3852730,0.9990
100.00,0,17485011,0.0000,2.4710,-687620.4316,VQSRTrancheSNP99.90to100.00,SNP,3856587,3856587,1.0000
I'm working in a non-human organism, so I bootstrapped by very stringent filtering of an initial call set (3.8M input training SNPs - 18 samples = WGS) with
GATK 2.5-2-gf57256b
and fed it into VQSR with the raw VCF.--reference_sequence $CASSAVA/genome/scaffolds+cpDNA.fasta \
--excludeIntervals $CASSAVA/genome/cassavaV4.1685010_168501.repeat.masked.srt.bed \
--mode SNP --dirichlet 0.01 \
--resource:bootstrap,known=false,training=true,truth=true,prior=8.0 bootstrap.red.HWE.flt.g60.recode.vcf \
--input Manihot_UG.rdx.vcf --recal_file Manihot_UG.rdx.DP.recal --tranches_file Manihot_UG.rdx.DP.tranches \
--rscript_file Manihot_UG.rdx.DP.Rplot \
-an BaseQRankSum -an HaplotypeScore -an QD -an DP -an ReadPosRankSum
Sorry about that being posted so many times. I think that was a Safari bug...
Anyway, my question is: Although my number of variants increases at higher sensitivity, does it seem unusual that my FP variant fraction decreases at higher sensitivity as well? And at the 100% truth sensitivity level, no FPs are reported? Am I mis-interpreting these plots? Is this a bug in calculating the tranches plot? As far as I'm able to infer from the example plots posted on the tutorial pages, the Cumulative TPs are calculated by simply by summing the TPs from the previous tranche, yet I have Cumulative TPs reported for my lowest tranche and none for my highest-sensitivity tranche...
Any help (correctly) interpreting these plots would be much appreciated!!!
I'm leaving a couple of tranches plots at the public FTP (a v2.5-2 calculated plot and a v2.6-3 calculated plot):
Manihot_UG.rdx.tranches-v2.5.png
Manihot_UG.rdx.tranches-v2.6.png
Hi there, sorry for the delayed response. No worries about the bug, though you almost got banned by the automatic spam filter.
I can't find your plots on the ftp, are you sure they uploaded successfully? You can attach them to a forum comment too, btw, which may be simpler.
Attached is the
--rscript_file
and the tranches plots for GATK v2.5 and v2.6, respectively.Hi there,
The ti/tv values for the tranches are quite a bit different for the two versions which is surprising. Can you post the log output for the two runs of VariantRecalibrator? There is a lot of interesting info in there that will help us debug.
Thanks for your help,
I, unfortunately, no longer have the v2.6 logging output. Perhaps the difference comes from specifying my stringently-filtered (unknown) dataset as
known=true/false
in the--resources
argument? At the time I was fiddling with parameters (setting known vs not). The v2.5 plot above was produced withknown=false
, and the v2.6 plot was produced withknown=true
. When I re-ran v2.5 withknown=true
the tranches plot looked very similar to the v2.6 plot. The log from the v2.5 (withknown=true
) is provided below (excluding ProgressMeter reports for current position in genome):To clarify, my issue is not about the difference between the two plots provided above, but rather that my fractions of TPs/FPs (in both plots) are "flipped" with respect to the titv values, relative to all of the example plots I've seen in the docs -- I'm expecting to see 100% tranche-specific SNPs in the 90-tranche with decreasing fraction of tranche-specific SNPs, and increasing FPs, as tranche sensitivity is increased, but this is not what I'm seeing. I'm wondering if there is an error in the way the tranche plot was calculated or if I had run VQSR incorrectly...
Hi,
I used to be able to run VQSR step for my old data, however, when I started the new version 2.6-4, I got the following error message from my log file:
.....
.....
INFO 18:38:46,252 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.08907
INFO 18:38:46,307 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.02334
INFO 18:38:46,363 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.03583
INFO 18:38:46,407 VariantRecalibratorEngine - Convergence after 54 iterations!
INFO 18:38:46,414 VariantRecalibratorEngine - Evaluating full set of 180431 variants...
INFO 18:38:47,634 GATKRunReport - Uploaded run statistics report to AWS S3
ERROR ------------------------------------------------------------------------------------------
ERROR stack trace
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Unable to retrieve result
at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
Caused by: java.lang.IllegalArgumentException: log10p: Values must be non-infinite and non-NAN
at org.broadinstitute.sting.utils.MathUtils.log10sumLog10(MathUtils.java:236)
at org.broadinstitute.sting.utils.MathUtils.log10sumLog10(MathUtils.java:224)
at org.broadinstitute.sting.utils.MathUtils.log10sumLog10(MathUtils.java:249)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.GaussianMixtureModel.evaluateDatum(GaussianMixtureModel.java:238)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.evaluateDatum(VariantRecalibratorEngine.java:167)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.evaluateData(VariantRecalibratorEngine.java:100)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:335)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:132)
at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226)
at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183)
... 5 more
ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.6-4-g3e5ff60):
ERROR
ERROR Please check the documentation guide to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Unable to retrieve result
ERROR ------------------------------------
Here is my command:
java -Xmx4g -jar /path/gatk/2.6-4/bin/GenomeAnalysisTK.jar -T VariantRecalibrator -R /Path/hg19.fa -nt 4
-input /Path/GATK_UG_SelSNP.vcf
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 /Path/bundle-2.5/hg19/hapmap_3.3.hg19.vcf
-resource:omni,known=false,training=true,truth=true,prior=12.0 /Pathb/bundle-2.5/hg19/1000G_omni2.5.hg19.vcf
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /Path/bundle-2.5/hg19/dbsnp_137.hg19.vcf
-resource:1000G,known=false,training=true,truth=false,prior=10.0 /Path/bundle-2.5/hg19/1000G_phase1.snps.high_confidence.hg19.vcf
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an DP -an InbreedingCoeff
-percentBad 0.01 -minNumBad 1000
-recalFile /Path/GATK_UG_SelSNP.VarRecal
-tranchesFile /Path/GATK_UG_SelSNP.tranches
-rscriptFile /Path/GATK_UG_SelSNP.R
-mode SNP
What is even odd is: when I run similar command for indels, it went through without any error, here is the indel command:
java -Xmx4g -jar /Path/gatk/2.6-4/bin/GenomeAnalysisTK.jar -T VariantRecalibrator -R /Path//hg19.fa
-input /Path/GATK_UG_SelIndel.vcf
-resource:mills,known=false,training=true,truth=true,prior=12.0 /Path/bundle-2.5/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /Path/bundle-2.5/hg19/dbsnp_137.hg19.vcf
-percentBad 0.01 -minNumBad 1000
-an DP -an FS -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff
--maxGaussians 4
-recalFile /Path/GATK_UG_SelIndel.VarRecal
-tranchesFile /Path/GATK_UG_SelIndel.tranches
-rscriptFile /Path/GATK_UG_SelIndel.R
-mode INDEL
Any idea why I got such error in SNP VQSR but not in Indel VQSR? In Indel, I used --maxGaussians 4. but in SNP, I used -nt 4 based on documnetation, other than the -resource files from bundle, these are only difference between the indel and SNP VQSR.
Thanks for the help!
Mike
I'm putting a tarball of the v2.5 and v2.6 VQSR plots, tranches, and run-log files on the FTP in the "incoming" folder called Manihot_VQSR.tar.gz
Scratch that, it's in the main public ftp root dir as Manihot_VQSR.tar.gz
-rw-r--r-- 1 depristo wga 27491078 Jul 4 01:55 Manihot_VQSR.tar.gz
A little explanation of the content naming in the tar ball:
1) files are labeled with the version of GATK used to do VQSR,
2) files are labeled with
true
orfalse
indicating the state that theknown
attribute was set to at the--resource
flag.Hi:
Just follow up on my last post, for SNP VQSR, I found out that the error was caused by the absence of --maxGaussians 4, as long as I added the option: --maxGaussians 4, with or without -nt 4, they all went through well. Not sure what default value for --maxGaussians. With -h , I can not see it. any way, any idea why it behaved like this?
Thanks for the help!
Mike
@bredeson, thanks for the files. Due to the holiday we may not be able to get to this for a couple of days though. Thanks for your patience.
HI there,
I think one source of the confusion with TP/FPs on the tranche plot is that in the 2.6 plot all of the tranches are over the specified target ti/tv value and so there is no way to estimate TP/FPs (everything is assumed to be a TP) but in the 2.5 plot you have ti/tv that go above and below the target ti/tv value. Is this data from an exome sequencing project? If so you probably want to increases that target ti/tv and I think the tranche plots will make more sense. Take a look at the --target_titv argument in VariantRecalibrator.
Let me know if that helps.
Hello rpoplin,
It's been a while, but I'm revisiting this subject, as I do not yet fully understand how my truth=90 tranche can have cumulative [TF]Ps...
I realigned around indels (which I thought I had done with the prior dataset) and then re-called (with UG) and re-ran VQSR with a
--target_titv
of2.5
and the following plots were outputted. This result, as before, still contains tranches with cumulative [TF]Ps in the starting tranche, and it is still not clear to me how that is possible: is this an artifact of my training data (stringently filtered sites from the same dataset) or of the way I ran VariantRecalibrator (see log below)?Also, the Cumulative FPs bins overlap the Tranch-Specific TPs bins; this looks like a plotting error to me, but is this intended/meaningful? Again, my concern stems from the dissimilarity of my plot with the example plots here: http://www.broadinstitute.org/gatk/guide/article?id=39
Thank you so much for your help!
INFO 12:26:54,300 ArgumentTypeDescriptor - Dynamically determined type of /projectb/projectdirs/plant/analysis/cassava/genome/cassavaV4.1685010_168501.repeat.masked.srt.bed to be BED INFO 12:26:54,384 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:26:54,384 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02 INFO 12:26:54,385 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 12:26:54,385 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 12:26:54,390 HelpFormatter - Program Args: -T VariantRecalibrator --reference_sequence /projectb/projectdirs/plant/analysis/cassava/genome/scaffolds+cpDNA.fasta --excludeIntervals /projectb/projectdirs/plant/analysis/cassava/genome/cassavaV4.1685010_168501.repeat.masked.srt.bed --mode SNP --resource:bootstrap,known=false,training=true,truth=true,prior=8.0 Manihot_UG.VF2+HW.cDP.vcf --input Manihot_UG.vcf --recal_file Manihot_UG.titv2_5.recal --tranches_file Manihot_UG.titv2_5.tranches --rscript_file Manihot_UG.titv2_5.Rplot -an BaseQRankSum -an FS -an InbreedingCoeff -an HaplotypeScore -an QD -an DP -an MQRankSum -an ReadPosRankSum --target_titv 2.5 INFO 12:26:54,391 HelpFormatter - Date/Time: 2013/08/06 12:26:54 INFO 12:26:54,391 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:26:54,391 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:26:54,456 ArgumentTypeDescriptor - Dynamically determined type of Manihot_UG.vcf to be VCF INFO 12:26:54,506 ArgumentTypeDescriptor - Dynamically determined type of Manihot_UG.VF2+HW.cDP.vcf to be VCF INFO 12:26:55,427 GenomeAnalysisEngine - Strictness is SILENT INFO 12:26:56,636 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 12:26:56,678 RMDTrackBuilder - Loading Tribble index from disk for file Manihot_UG.vcf INFO 12:26:57,559 RMDTrackBuilder - Loading Tribble index from disk for file Manihot_UG.VF2+HW.cDP.vcf INFO 12:26:58,768 IntervalUtils - Initial include intervals span 532668733 loci; exclude intervals span 7340626 loci INFO 12:26:58,772 IntervalUtils - Excluding 7340626 loci from original intervals (1.38% reduction) INFO 12:26:58,773 IntervalUtils - Processing 525328107 bp from intervals INFO 12:26:59,279 GenomeAnalysisEngine - Creating shard strategy for 0 BAM files INFO 12:26:59,317 GenomeAnalysisEngine - Done creating shard strategy INFO 12:26:59,318 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 12:26:59,318 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 12:26:59,326 TrainingSet - Found bootstrap track: Known = false Training = true Truth = true Prior = Q8.0 ... INFO 13:08:36,396 ProgressMeter - scaffold12914:16853 3.04e+07 41.6 m 82.0 s 99.7% 41.8 m 8.0 s INFO 13:08:47,619 VariantDataManager - BaseQRankSum: mean = 1.09 standard deviation = 1.86 INFO 13:08:53,561 VariantDataManager - FS: mean = 7.53 standard deviation = 9.53 INFO 13:08:59,407 VariantDataManager - InbreedingCoeff: mean = 0.17 standard deviation = 0.36 INFO 13:09:05,407 VariantDataManager - HaplotypeScore: mean = 6.81 standard deviation = 6.65 INFO 13:09:06,399 ProgressMeter - scaffold12977:1750 3.05e+07 42.1 m 82.0 s 100.0% 42.1 m 0.0 s INFO 13:09:11,277 VariantDataManager - QD: mean = 7.96 standard deviation = 5.89 INFO 13:09:17,085 VariantDataManager - DP: mean = 567.93 standard deviation = 267.48 INFO 13:09:22,875 VariantDataManager - MQRankSum: mean = 0.00 standard deviation = 2.74 INFO 13:09:28,635 VariantDataManager - ReadPosRankSum: mean = 0.73 standard deviation = 1.75 INFO 13:09:36,401 ProgressMeter - scaffold12977:1750 3.05e+07 42.6 m 83.0 s 100.0% 42.6 m 0.0 s INFO 13:09:37,230 VariantDataManager - Training with 13035063 variants after standard deviation thresholding. INFO 13:09:37,291 GaussianMixtureModel - Initializing model with 30 k-means iterations... ... INFO 18:07:42,225 VariantRecalibratorEngine - Evaluating full set of 17522718 variants... INFO 18:08:04,181 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.2 m 100.0% 5.7 h 0.0 s INFO 18:08:34,183 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.2 m 100.0% 5.7 h 0.0 s INFO 18:09:04,185 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.2 m 100.0% 5.7 h 0.0 s INFO 18:09:34,187 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.2 m 100.0% 5.7 h 0.0 s INFO 18:10:04,190 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.2 m 100.0% 5.7 h 0.0 s INFO 18:10:34,192 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.3 m 100.0% 5.7 h 0.0 s INFO 18:10:37,827 VariantDataManager - Found 0 variants overlapping bad sites training tracks. INFO 18:11:04,194 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.3 m 100.0% 5.7 h 0.0 s INFO 18:11:26,512 VariantDataManager - Additionally training with worst 3.000% of passing data --> 525682 variants with LOD <= -14.8842. INFO 18:11:26,513 GaussianMixtureModel - Initializing model with 30 k-means iterations... INFO 18:11:34,196 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.3 m 100.0% 5.7 h 0.0 s INFO 18:12:04,199 ProgressMeter - scaffold12977:1750 3.05e+07 5.8 h 11.3 m 100.0% 5.8 h 0.0 s INFO 18:12:34,201 ProgressMeter - scaffold12977:1750 3.05e+07 5.8 h 11.3 m 100.0% 5.8 h 0.0 s INFO 18:12:35,512 VariantRecalibratorEngine - Finished iteration 0. INFO 18:13:04,203 ProgressMeter - scaffold12977:1750 3.05e+07 5.8 h 11.3 m 100.0% 5.8 h 0.0 s INFO 18:13:34,205 ProgressMeter - scaffold12977:1750 3.05e+07 5.8 h 11.4 m 100.0% 5.8 h 0.0 s ... INFO 18:22:10,281 VariantRecalibratorEngine - Convergence after 36 iterations! ... INFO 19:06:35,704 VariantRecalibrator - Executing: Rscript /global/projectb/scratch/bredeson/Manihot_esculenta/WGS_HQ/UnifiedGenotyper_Realigned/Manihot_UG.titv2_5.Rplot ... INFO 19:11:17,745 VariantRecalibrator - Executing: Rscript (resource)org/broadinstitute/sting/gatk/walkers/variantrecalibration/plot_Tranches.R /global/projectb/scratch/bredeson/Manihot_esculenta/WGS_HQ/UnifiedGenotyper_Realigned/Manihot_UG.titv2_5.tranches 2.5 ... INFO 19:11:32,736 ProgressMeter - done 3.05e+07 6.7 h 13.3 m 100.0% 6.7 h 0.0 s INFO 19:11:32,736 ProgressMeter - Total runtime 24273.42 secs, 404.56 min, 6.74 hours INFO 19:12:07,556 GATKRunReport - Uploaded run statistics report to AWS S3
@rpoplin: To answer your previous question, this is WGS data.
Hi I have analyzed 2000 exome samples using GATK 3.2.2 and I have ended up with a huge around 700gb VCF file after VQSR. My tranch plot looks really bad (attached file) and the ti/tv ratio is really low. I did not use the interval list (-L) during my analysis .
Now, can I use the interval list and exclude those variants outside the intervals after the VQSR ?? or do I have to run all the steps again ?? Please let me know. Thank you in advance.
Just checked that the vcf file has got 29513599 SNPs that passed the VQSR to let you know.
@dheeraj You can re-run just the VQSR steps, this time using the -L intervals. This will eliminate the variants that are outside your target regions. Also, you can specify the expected titv argument to match your expectation for exome, which is different from whole genome.