Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

question about odd result for VQSR

mikemike Posts: 103Member

Hi,

Recently I run into some odd observation in VQSR. I have 17 samples from a same family and I used all of 17 samples to call SNPs and after VQSR, I got the trench file like this:

Variant quality score tranches file

Version number 5

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,48637,716,2.9527,2.3302,4.8390,VQSRTrancheSNP0.00to90.00,SNP,26182,23563,0.9000
99.00,60114,1531,2.8057,2.3333,1.7766,VQSRTrancheSNP90.00to99.00,SNP,26182,25920,0.9900
99.90,67220,2884,2.7190,1.8222,-10.0009,VQSRTrancheSNP99.00to99.90,SNP,26182,26155,0.9990
100.00,69714,4998,2.6822,1.8300,-1122.0698,VQSRTrancheSNP99.90to100.00,SNP,26182,26182,1.0000

which seems fine. then for research purpose, I only used 5 samples of more tight relation such as two parents and their 3 immediate children and after VQSR, the trench file looks like below:

Variant quality score tranches file

Version number 5

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP0.00to90.00,SNP,20850,20850,1.0000
99.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP90.00to99.00,SNP,20850,20850,1.0000
99.90,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP99.00to99.90,SNP,20850,20850,1.0000
100.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP99.90to100.00,SNP,20850,20850,1.0000

Notice that the 5-sample VQSR tranch file has exactly the same thing throughout all thresholds: 90, 99, 99.90 and 100. and the VQSR modeling plot is also very odd, no plotting at all being seen (the pdf ifle was created but was almost blank in contrast to the normal projection plots I saw in other cases)

However, we did use the old version to call the same 5 samples before, and the trench file looks like below:

Variant quality score tranches file

Version number 4

targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,accessibleTruthSites,callsAtTruthSites,truthSensitivity
90.00,36407,361,2.8657,2.3119,5.0854,TruthSensitivityTranche0.00to90.00,20814,18732,0.9000
99.00,44097,638,2.7655,2.2222,2.2592,TruthSensitivityTranche90.00to99.00,20814,20605,0.9900
99.90,47947,1061,2.7078,1.8750,-7.4143,TruthSensitivityTranche99.00to99.90,20814,20793,0.9990
100.00,50426,2318,2.6645,1.7677,-647.3944,TruthSensitivityTranche99.90to100.00,20814,20814,1.0000

this time, it looks reasonable to me. This is troubling us since for 5 samples, the old version (V1.6-7) seems working fine, whereas the new version (V2.1-13) seems having issue or can not get further filtering by VQSR (90, 99 and 100 got the same result, I did repeat multiple times and got the same results), although for all of the 17 samples, the new version seems fine on VQSR.

So my questions are:
1. is it possible that in some occasion, VQSR can simply not work?
2. Why the old version seems working but not the new version for exactly the same set of 5-sample data?

Thanks a lot for your help!

Mike

Tagged:

Answers

  • mikemike Posts: 103Member

    Hi,

    After carefully checking, I found more info although still puzzled:

    I compared the old and new version commands side by side and realized a major difference in Unified genotyper command, in which the old version use -dcov 50 as default option suggested at the time (I do have the old documnetation for old version saved as in pdf file) and now in new version (V2) it changed to -dcov 200 (-dcov [50 for 4x, 200 for >30x WGS or Whole exome], see URL:http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html) and so for V2, I did use -dcov 200, which caused the odd result as mentioned above, More surprisingly, when I re-run the v2 GATK using UG with -dcov 50 like in old version for the same dataset (just 5 samples), I got similar result as the old version result. the tranch file looks like below, which seems much reasonable in terms of TiTv ratios improvement from 99.9 to 99 to 90 etc:

    Variant quality score tranches file

    Version number 5

    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
    90.00,35484,316,2.9264,2.1600,4.6455,VQSRTrancheSNP0.00to90.00,SNP,20849,18764,0.9000
    99.00,44135,624,2.7757,2.1515,1.9238,VQSRTrancheSNP90.00to99.00,SNP,20849,20640,0.9900
    99.90,47937,1007,2.7121,1.8523,-7.7956,VQSRTrancheSNP99.00to99.90,SNP,20849,20828,0.9990
    100.00,50469,2133,2.6676,1.7890,-777.6778,VQSRTrancheSNP99.90to100.00,SNP,20849,20849,1.000

    Also the VQSR modeling plot looks also normal as I usually saw, whereas the one using UG -dcov 200 give a very odd plot (only one page and also almost no plot at all for the whole pdf file for the model plot)

    So my question is: why the option of dcov in UG step (although I used what was suggested by both old and new version documentations) has such big impact on the VQSR later disregard of whether it is old or new version I am running?

    Thanks a lot for any insights!

    Mike

  • rpoplinrpoplin Posts: 121GATK Developer mod

    Hi there,

    There seems to be a lot of details here. Can you please post the full command lines that you are using to help us understand what you are seeing.

    Cheers,

  • mikemike Posts: 103Member

    Hi, Ryan:

    Thanks for looking into this.

    I just posted the UG steps that I found may be the major causes, let me know if you need other info:

    java -jar /opt/nasapps/stow/GenomeAnalysisTK-2.2-4-g4a174fb/bin/GenomeAnalysisTK.jar -T UnifiedGenotyper
    -R /path/hg19.fa

    -I /path/S1.bam -I /path/S2.bam -I /path/S3.bam -I /path/S4.bam -I /path/S5.bam
    --dbsnp /Path/dbsnp_135.hg19.vcf
    -glm BOTH
    -o /Path/GATK_UG_5Samples_snps_indel.raw.afterRecal_2012_V2.vcf
    -stand_call_conf 50 -stand_emit_conf 10 -dcov 200
    -L /Path/029720_D_BED_20101013.bed

    After this, I used SelectVariants to get SNPs only vcf file, which was subjected to VQSR step, then I got the odd VQSR tranches file as shown above

    However, for the same set of 5 samples, for UG step like above command, I just use -dcov 50 instead of -dcov 200, then I got the reasonable tranches file shown above (in my 2nd post).

    Is it odd? Did I do something wrong here?

    Thanks

    Mike

  • rpoplinrpoplin Posts: 121GATK Developer mod

    Hi there,

    I see that you are using an intervals list: -L /Path/029720_D_BED_20101013.bed
    Is this by any chance a targeted exome or smaller project? We don't recommend using the VQSR on such projects with small numbers of samples. My guess is that the fluctuations you are seeing don't have anything to do with the downsampling in the UG but are rather a sign of numerical instability in the VQSR with small numbers of variants. Take a look at our best practice recommendations for advice on how you might be able to use the VQSR with your project.

    I hope that helps,

  • mikemike Posts: 103Member

    Hi, Ryan:

    Thanks for the input!

    yes, it is a small project but as whole exome seq (the bed file is the whole exome target regions from agilent) with just a family of individuals with some of them with a type of cancer suspected to be inheritance-related or genetic cause. I double check the documentation of VQSR, which emphasize on the number of variant sites in the callsets at range of thousands of variants (except for using InbreedingCoeff needs at least 10 samples), and my dataset does have about 50k variants after UG for only 5 samples ( a subset of the family with Trio relations) and 70 k for all members of the big family (17), so I though it is reasonable to use VQSR, does it? I did on just 5 of them is because these 5 samples are with trio relation, we just want to see any improvement of the calls. I did do the variant calls and VQSR on all 17 samples of the whole family as well. which looks fine. So even if I have 50k variants in the 5-sample callset, the number of samples as 5 still not good enough?

    I just wonder why even for 5 sample callset, after using -docv 50 on UG step, and it seems now that VQSR has the separation for 0.90 and 0.99 thresholds as mentioned above, compared to if use -dcov 200?

    Any insight?

    Thanks again

    Mike

  • mikemike Posts: 103Member
    edited December 2012

    Hi, Ryan:

    I did try VQSR with --maxGaussians 4 (used to be --maxGaussians 6), now I got tranches file as below, which looks fine with me now:

    # Variant quality score tranches file
    # Version number 5
    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
    90.00,36765,342,2.8754,2.1963,4.9773,VQSRTrancheSNP0.00to90.00,SNP,20850,18765,0.9000
    99.00,43801,595,2.7815,2.2514,2.5320,VQSRTrancheSNP90.00to99.00,SNP,20850,20641,0.9900
    99.90,47986,1045,2.7042,1.7909,-9.6192,VQSRTrancheSNP99.00to99.90,SNP,20850,20829,0.9990
    100.00,50598,2279,2.6625,1.7993,-1049.0069,VQSRTrancheSNP99.90to100.00,SNP,20850,20850,1.0000

    did not realize that --maxGaussians has such great impact on this. What do you think?

    Thanks and best

    Mike

    Post edited by Geraldine_VdAuwera on
  • bredesonbredeson Posts: 20Member

    I have a question about the VQSR's tranches output. I've been scouring the GATK pages and I noticed that my tranches plot does not quite behave the way it is described in the howto:

    The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls.

    My plots appear to be behaving in the opposite manner, or maybe I'm just misunderstanding what is meant by "the lowest tranche"?

    My tranches file:

    # Variant quality score tranches file
    # Version number 5
    targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
    90.00,0,7954491,0.0000,1.9963,10.6484,VQSRTrancheSNP0.00to90.00,SNP,3856587,3470928,0.9000
    99.00,0,9848620,0.0000,2.0676,8.4217,VQSRTrancheSNP90.00to99.00,SNP,3856587,3818021,0.9900
    99.90,0,11022663,0.0000,2.1288,6.0801,VQSRTrancheSNP99.00to99.90,SNP,3856587,3852730,0.9990
    100.00,0,17485011,0.0000,2.4710,-687620.4316,VQSRTrancheSNP99.90to100.00,SNP,3856587,3856587,1.0000

    I'm working in a non-human organism, so I bootstrapped by very stringent filtering of an initial call set (3.8M input training SNPs - 18 samples = WGS) with GATK 2.5-2-gf57256b and fed it into VQSR with the raw VCF.

    GenomeAnalysisTK VariantRecalibrator \

    --reference_sequence $CASSAVA/genome/scaffolds+cpDNA.fasta \
    --excludeIntervals $CASSAVA/genome/cassavaV4.1685010_168501.repeat.masked.srt.bed \
    --mode SNP --dirichlet 0.01 \
    --resource:bootstrap,known=false,training=true,truth=true,prior=8.0 bootstrap.red.HWE.flt.g60.recode.vcf \
    --input Manihot_UG.rdx.vcf --recal_file Manihot_UG.rdx.DP.recal --tranches_file Manihot_UG.rdx.DP.tranches \
    --rscript_file Manihot_UG.rdx.DP.Rplot \
    -an BaseQRankSum -an HaplotypeScore -an QD -an DP -an ReadPosRankSum

  • bredesonbredeson Posts: 20Member

    Sorry about that being posted so many times. I think that was a Safari bug...

    Anyway, my question is: Although my number of variants increases at higher sensitivity, does it seem unusual that my FP variant fraction decreases at higher sensitivity as well? And at the 100% truth sensitivity level, no FPs are reported? Am I mis-interpreting these plots? Is this a bug in calculating the tranches plot? As far as I'm able to infer from the example plots posted on the tutorial pages, the Cumulative TPs are calculated by simply by summing the TPs from the previous tranche, yet I have Cumulative TPs reported for my lowest tranche and none for my highest-sensitivity tranche...

    Any help (correctly) interpreting these plots would be much appreciated!!!

    I'm leaving a couple of tranches plots at the public FTP (a v2.5-2 calculated plot and a v2.6-3 calculated plot):

    Manihot_UG.rdx.tranches-v2.5.png
    Manihot_UG.rdx.tranches-v2.6.png

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    Hi there, sorry for the delayed response. No worries about the bug, though you almost got banned by the automatic spam filter.

    I can't find your plots on the ftp, are you sure they uploaded successfully? You can attach them to a forum comment too, btw, which may be simpler.

    Geraldine Van der Auwera, PhD

  • bredesonbredeson Posts: 20Member

    Attached is the --rscript_file and the tranches plots for GATK v2.5 and v2.6, respectively.

    Manihot_UG.rdx.Rplot.png
    667 x 694 - 110K
    Manihot_UG.rdx.tranches-v2.5.png
    766 x 484 - 40K
    Manihot_UG.rdx.tranches-v2.6.png
    764 x 437 - 38K
  • rpoplinrpoplin Posts: 121GATK Developer mod

    Hi there,

    The ti/tv values for the tranches are quite a bit different for the two versions which is surprising. Can you post the log output for the two runs of VariantRecalibrator? There is a lot of interesting info in there that will help us debug.

    Thanks for your help,

  • bredesonbredeson Posts: 20Member

    I, unfortunately, no longer have the v2.6 logging output. Perhaps the difference comes from specifying my stringently-filtered (unknown) dataset as known=true/false in the --resources argument? At the time I was fiddling with parameters (setting known vs not). The v2.5 plot above was produced with known=false, and the v2.6 plot was produced with known=true. When I re-ran v2.5 with known=true the tranches plot looked very similar to the v2.6 plot. The log from the v2.5 (with known=true) is provided below (excluding ProgressMeter reports for current position in genome):

    To clarify, my issue is not about the difference between the two plots provided above, but rather that my fractions of TPs/FPs (in both plots) are "flipped" with respect to the titv values, relative to all of the example plots I've seen in the docs -- I'm expecting to see 100% tranche-specific SNPs in the 90-tranche with decreasing fraction of tranche-specific SNPs, and increasing FPs, as tranche sensitivity is increased, but this is not what I'm seeing. I'm wondering if there is an error in the way the tranche plot was calculated or if I had run VQSR incorrectly...

    INFO  22:43:06,315 ArgumentTypeDescriptor - Dynamically determined type of /projectb/projectdirs/plant/analysis/cassava/genome/cassavaV4.1685010_168501.repeat.masked.srt.bed to be BED 
    INFO 22:43:06,382 HelpFormatter - --------------------------------------------------------------------------------
    INFO 22:43:06,383 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02
    INFO 22:43:06,383 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 22:43:06,383 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 22:43:06,389 HelpFormatter - Program Args: -T VariantRecalibrator --reference_sequence /projectb/projectdirs/plant/analysis/cassava/genome/scaffolds+cpDNA.fasta --excludeIntervals /projectb/projectdirs/plant/analysis/cassava/genome/cassavaV4.1685010_168501.repeat.masked.srt.bed --mode SNP --dirichlet 0.01 --resource:bootstrap,known=true,training=true,truth=true,prior=8.0 bootstrap.red.HWE.flt.g60.recode.vcf --input Manihot_UG.rdx.vcf --recal_file Manihot_UG.rdx.DP.recal --tranches_file Manihot_UG.rdx.DP.tranches --rscript_file Manihot_UG.rdx.DP.Rplot -an BaseQRankSum -an HaplotypeScore -an QD -an DP -an ReadPosRankSum
    INFO 22:43:06,389 HelpFormatter - Date/Time: 2013/06/28 22:43:06
    INFO 22:43:06,389 HelpFormatter - --------------------------------------------------------------------------------
    INFO 22:43:06,389 HelpFormatter - --------------------------------------------------------------------------------
    INFO 22:43:06,448 ArgumentTypeDescriptor - Dynamically determined type of Manihot_UG.rdx.vcf to be VCF
    INFO 22:43:06,552 ArgumentTypeDescriptor - Dynamically determined type of bootstrap.red.HWE.flt.g60.recode.vcf to be VCF
    INFO 22:43:07,383 GenomeAnalysisEngine - Strictness is SILENT
    INFO 22:43:08,550 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 22:43:08,582 RMDTrackBuilder - Loading Tribble index from disk for file Manihot_UG.rdx.vcf
    INFO 22:43:09,560 RMDTrackBuilder - Loading Tribble index from disk for file bootstrap.red.HWE.flt.g60.recode.vcf
    INFO 22:43:10,655 IntervalUtils - Initial include intervals span 532668733 loci; exclude intervals span 7340626 loci
    INFO 22:43:10,658 IntervalUtils - Excluding 7340626 loci from original intervals (1.38% reduction)
    INFO 22:43:10,659 IntervalUtils - Processing 525328107 bp from intervals
    INFO 22:43:11,093 GenomeAnalysisEngine - Creating shard strategy for 0 BAM files
    INFO 22:43:11,129 GenomeAnalysisEngine - Done creating shard strategy
    INFO 22:43:11,130 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 22:43:11,130 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
    INFO 22:43:11,141 TrainingSet - Found bootstrap track: Known = true Training = true Truth = true Prior = Q8.0
    INFO 23:08:02,647 VariantDataManager - BaseQRankSum: mean = 1.66 standard deviation = 1.84
    INFO 23:08:07,565 VariantDataManager - HaplotypeScore: mean = 2.77 standard deviation = 2.29
    INFO 23:08:12,461 VariantDataManager - QD: mean = 10.14 standard deviation = 6.06
    INFO 23:08:17,310 VariantDataManager - DP: mean = 627.20 standard deviation = 141.27
    INFO 23:08:22,139 VariantDataManager - ReadPosRankSum: mean = 0.46 standard deviation = 1.53
    INFO 23:08:30,001 VariantDataManager - Training with 3856585 variants after standard deviation thresholding.
    INFO 23:08:30,034 GaussianMixtureModel - Initializing model with 30 k-means iterations...
    INFO 23:13:01,766 VariantRecalibratorEngine - Finished iteration 0.
    INFO 23:18:42,510 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.27554
    INFO 23:23:58,014 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.12515
    INFO 23:29:14,003 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.08745
    INFO 23:34:30,042 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.07335
    INFO 23:40:13,629 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.06055
    INFO 23:46:03,272 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.05687
    INFO 23:51:48,713 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.06024
    INFO 23:57:35,193 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.05803
    INFO 00:03:24,026 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.05324
    INFO 00:09:08,881 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.04774
    INFO 00:14:54,715 VariantRecalibratorEngine - Finished iteration 55. Current change in mixture coefficients = 0.04201
    INFO 00:20:44,183 VariantRecalibratorEngine - Finished iteration 60. Current change in mixture coefficients = 0.03552
    INFO 00:26:32,051 VariantRecalibratorEngine - Finished iteration 65. Current change in mixture coefficients = 0.02855
    INFO 00:32:22,767 VariantRecalibratorEngine - Finished iteration 70. Current change in mixture coefficients = 0.02202
    INFO 00:34:42,515 VariantRecalibratorEngine - Convergence after 72 iterations!
    INFO 00:35:28,152 VariantRecalibratorEngine - Evaluating full set of 17485011 variants...
    INFO 00:37:07,046 VariantDataManager - Found 0 variants overlapping bad sites training tracks.
    INFO 00:38:00,059 VariantDataManager - Additionally training with worst 3.000% of passing data --> 524550 variants with LOD <= -33.4560.
    INFO 00:38:00,059 GaussianMixtureModel - Initializing model with 30 k-means iterations...
    INFO 00:39:18,063 VariantRecalibratorEngine - Finished iteration 0.
    INFO 00:40:18,395 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.52601
    INFO 00:41:18,504 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.13586
    INFO 00:42:19,106 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.08304
    INFO 00:43:19,823 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.06336
    INFO 00:44:20,936 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.04779
    INFO 00:45:21,891 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.03722
    INFO 00:46:22,500 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.02795
    INFO 00:47:23,425 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.02178
    INFO 00:48:24,703 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.02168
    INFO 00:49:25,391 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.02133
    INFO 00:50:26,342 VariantRecalibratorEngine - Finished iteration 55. Current change in mixture coefficients = 0.02017
    INFO 00:50:38,529 VariantRecalibratorEngine - Convergence after 56 iterations!
    INFO 00:50:47,249 VariantRecalibratorEngine - Evaluating full set of 17485011 variants...
    INFO 01:09:34,114 TrancheManager - Finding 4 tranches for 17485011 variants
    INFO 01:10:35,852 TrancheManager - Tranche threshold 100.00 => selection metric threshold 0.000
    INFO 01:10:44,787 TrancheManager - Found tranche for 100.000: 0.000 threshold starting with variant 0; running score is 0.000
    INFO 01:10:44,788 TrancheManager - Tranche is Tranche ts=100.00 minVQSLod=-687620.4316 known=(3856587 @ 1.6917) novel=(13628424 @ 2.7844) truthSites(3856587 accessible, 3856587 called), name=anonymous]
    INFO 01:10:44,789 TrancheManager - Tranche threshold 99.90 => selection metric threshold 0.001
    INFO 01:10:53,313 TrancheManager - Found tranche for 99.900: 0.001 threshold starting with variant 6462348; running score is 0.001
    INFO 01:10:53,314 TrancheManager - Tranche is Tranche ts=99.90 minVQSLod=6.0801 known=(3852730 @ 1.6912) novel=(7169933 @ 2.4316) truthSites(3856587 accessible, 3852730 called), name=anonymous]
    INFO 01:10:53,314 TrancheManager - Tranche threshold 99.00 => selection metric threshold 0.010
    INFO 01:11:01,750 TrancheManager - Found tranche for 99.000: 0.010 threshold starting with variant 7636391; running score is 0.010
    INFO 01:11:01,750 TrancheManager - Tranche is Tranche ts=99.00 minVQSLod=8.4217 known=(3818021 @ 1.6880) novel=(6030599 @ 2.3723) truthSites(3856587 accessible, 3818021 called), name=anonymous]
    INFO 01:11:01,751 TrancheManager - Tranche threshold 90.00 => selection metric threshold 0.100
    INFO 01:11:10,087 TrancheManager - Found tranche for 90.000: 0.100 threshold starting with variant 9530520; running score is 0.100
    INFO 01:11:10,088 TrancheManager - Tranche is Tranche ts=90.00 minVQSLod=10.6484 known=(3470928 @ 1.6673) novel=(4483563 @ 2.3165) truthSites(3856587 accessible, 3470928 called), name=anonymous]
    INFO 01:11:10,111 VariantRecalibrator - Writing out recalibration table...
    INFO 01:15:42,042 VariantRecalibrator - Writing out visualization Rscript file...
    INFO 01:15:42,146 VariantRecalibrator - Building BaseQRankSum x HaplotypeScore plot...
    INFO 01:15:42,178 VariantRecalibratorEngine - Evaluating full set of 21267 variants...
    INFO 01:15:45,775 VariantRecalibratorEngine - Evaluating full set of 21267 variants...
    INFO 01:15:50,071 VariantRecalibrator - Building BaseQRankSum x QD plot...
    INFO 01:15:50,076 VariantRecalibratorEngine - Evaluating full set of 8062 variants...
    INFO 01:15:51,466 VariantRecalibratorEngine - Evaluating full set of 8062 variants...
    INFO 01:15:53,063 VariantRecalibrator - Building BaseQRankSum x DP plot...
    INFO 01:15:53,066 VariantRecalibratorEngine - Evaluating full set of 25298 variants...
    INFO 01:15:57,431 VariantRecalibratorEngine - Evaluating full set of 25298 variants...
    INFO 01:16:02,356 VariantRecalibrator - Building BaseQRankSum x ReadPosRankSum plot...
    INFO 01:16:02,360 VariantRecalibratorEngine - Evaluating full set of 36974 variants...
    INFO 01:16:08,772 VariantRecalibratorEngine - Evaluating full set of 36974 variants...
    INFO 01:16:15,910 VariantRecalibrator - Building HaplotypeScore x QD plot...
    INFO 01:16:15,912 VariantRecalibratorEngine - Evaluating full set of 8874 variants...
    INFO 01:16:17,413 VariantRecalibratorEngine - Evaluating full set of 8874 variants...
    INFO 01:16:19,169 VariantRecalibrator - Building HaplotypeScore x DP plot...
    INFO 01:16:19,173 VariantRecalibratorEngine - Evaluating full set of 27846 variants...
    INFO 01:16:23,870 VariantRecalibratorEngine - Evaluating full set of 27846 variants...
    INFO 01:16:29,363 VariantRecalibrator - Building HaplotypeScore x ReadPosRankSum plot...
    INFO 01:16:29,369 VariantRecalibratorEngine - Evaluating full set of 40698 variants...
    INFO 01:16:36,254 VariantRecalibratorEngine - Evaluating full set of 40698 variants...
    INFO 01:16:44,155 VariantRecalibrator - Building QD x DP plot...
    INFO 01:16:44,158 VariantRecalibratorEngine - Evaluating full set of 10556 variants...
    INFO 01:16:45,982 VariantRecalibratorEngine - Evaluating full set of 10556 variants...
    INFO 01:16:48,058 VariantRecalibrator - Building QD x ReadPosRankSum plot...
    INFO 01:16:48,061 VariantRecalibratorEngine - Evaluating full set of 15428 variants...
    INFO 01:16:50,724 VariantRecalibratorEngine - Evaluating full set of 15428 variants...
    INFO 01:16:53,702 VariantRecalibrator - Building DP x ReadPosRankSum plot...
    INFO 01:16:53,707 VariantRecalibratorEngine - Evaluating full set of 48412 variants...
    INFO 01:17:02,067 VariantRecalibratorEngine - Evaluating full set of 48412 variants...
    INFO 01:17:11,355 VariantRecalibrator - Executing: Rscript /global/projectb/scratch/bredeson/Manihot_esculenta/WGS_HQ/UnifiedGenotyper/Manihot_UG.rdx.DP.Rplot
    INFO 01:19:24,826 VariantRecalibrator - Executing: Rscript (resource)org/broadinstitute/sting/gatk/walkers/variantrecalibration/plot_Tranches.R /global/projectb/scratch/bredeson/Manihot_esculenta/WGS_HQ/UnifiedGenotyper/Manihot_UG.rdx.DP.tranches 2.15
    INFO 01:19:39,354 ProgressMeter - done 2.96e+07 2.6 h 5.3 m 100.0% 2.6 h 0.0 s
    INFO 01:19:39,354 ProgressMeter - Total runtime 9388.22 secs, 156.47 min, 2.61 hours
    INFO 01:19:56,862 GATKRunReport - Uploaded run statistics report to AWS S3
  • mikemike Posts: 103Member

    Hi,

    I used to be able to run VQSR step for my old data, however, when I started the new version 2.6-4, I got the following error message from my log file:

    .....
    .....
    INFO 18:38:46,252 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.08907
    INFO 18:38:46,307 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.02334
    INFO 18:38:46,363 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.03583
    INFO 18:38:46,407 VariantRecalibratorEngine - Convergence after 54 iterations!
    INFO 18:38:46,414 VariantRecalibratorEngine - Evaluating full set of 180431 variants...
    INFO 18:38:47,634 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Unable to retrieve result
    at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
    Caused by: java.lang.IllegalArgumentException: log10p: Values must be non-infinite and non-NAN
    at org.broadinstitute.sting.utils.MathUtils.log10sumLog10(MathUtils.java:236)
    at org.broadinstitute.sting.utils.MathUtils.log10sumLog10(MathUtils.java:224)
    at org.broadinstitute.sting.utils.MathUtils.log10sumLog10(MathUtils.java:249)
    at org.broadinstitute.sting.gatk.walkers.variantrecalibration.GaussianMixtureModel.evaluateDatum(GaussianMixtureModel.java:238)
    at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.evaluateDatum(VariantRecalibratorEngine.java:167)
    at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.evaluateData(VariantRecalibratorEngine.java:100)
    at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:335)
    at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:132)
    at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226)
    at org.broadinstitute.sting.gatk.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183)
    ... 5 more

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 2.6-4-g3e5ff60):
    ERROR
    ERROR Please check the documentation guide to see if this is a known problem
    ERROR If not, please post the error, with stack trace, to the GATK forum
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Unable to retrieve result
    ERROR ------------------------------------

    Here is my command:

    java -Xmx4g -jar /path/gatk/2.6-4/bin/GenomeAnalysisTK.jar -T VariantRecalibrator -R /Path/hg19.fa -nt 4
    -input /Path/GATK_UG_SelSNP.vcf
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /Path/bundle-2.5/hg19/hapmap_3.3.hg19.vcf
    -resource:omni,known=false,training=true,truth=true,prior=12.0 /Pathb/bundle-2.5/hg19/1000G_omni2.5.hg19.vcf
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /Path/bundle-2.5/hg19/dbsnp_137.hg19.vcf
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 /Path/bundle-2.5/hg19/1000G_phase1.snps.high_confidence.hg19.vcf
    -an QD -an MQRankSum -an ReadPosRankSum -an FS -an DP -an InbreedingCoeff

    -percentBad 0.01 -minNumBad 1000

    -recalFile /Path/GATK_UG_SelSNP.VarRecal
    -tranchesFile /Path/GATK_UG_SelSNP.tranches
    -rscriptFile /Path/GATK_UG_SelSNP.R
    -mode SNP

    What is even odd is: when I run similar command for indels, it went through without any error, here is the indel command:

    java -Xmx4g -jar /Path/gatk/2.6-4/bin/GenomeAnalysisTK.jar -T VariantRecalibrator -R /Path//hg19.fa
    -input /Path/GATK_UG_SelIndel.vcf
    -resource:mills,known=false,training=true,truth=true,prior=12.0 /Path/bundle-2.5/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /Path/bundle-2.5/hg19/dbsnp_137.hg19.vcf
    -percentBad 0.01 -minNumBad 1000
    -an DP -an FS -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff
    --maxGaussians 4
    -recalFile /Path/GATK_UG_SelIndel.VarRecal
    -tranchesFile /Path/GATK_UG_SelIndel.tranches
    -rscriptFile /Path/GATK_UG_SelIndel.R
    -mode INDEL

    Any idea why I got such error in SNP VQSR but not in Indel VQSR? In Indel, I used --maxGaussians 4. but in SNP, I used -nt 4 based on documnetation, other than the -resource files from bundle, these are only difference between the indel and SNP VQSR.

    Thanks for the help!

    Mike

  • bredesonbredeson Posts: 20Member
    edited July 2013

    I'm putting a tarball of the v2.5 and v2.6 VQSR plots, tranches, and run-log files on the FTP in the "incoming" folder called Manihot_VQSR.tar.gz

    Post edited by bredeson on
  • bredesonbredeson Posts: 20Member
    edited July 2013

    @bredeson said:
    I'm putting a tarball of the v2.5 and v2.6 VQSR plots, tranches, and run-log files on the FTP in the "incoming" folder called Manihot_VQSR.tar.gz

    Scratch that, it's in the main public ftp root dir as Manihot_VQSR.tar.gz

    -rw-r--r-- 1 depristo wga 27491078 Jul 4 01:55 Manihot_VQSR.tar.gz

    A little explanation of the content naming in the tar ball:

    1) files are labeled with the version of GATK used to do VQSR,
    2) files are labeled with true or false indicating the state that the known attribute was set to at the --resource flag.

    Post edited by bredeson on
  • mikemike Posts: 103Member

    Hi:

    Just follow up on my last post, for SNP VQSR, I found out that the error was caused by the absence of --maxGaussians 4, as long as I added the option: --maxGaussians 4, with or without -nt 4, they all went through well. Not sure what default value for --maxGaussians. With -h , I can not see it. any way, any idea why it behaved like this?

    Thanks for the help!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,840Administrator, GATK Developer admin

    @bredeson, thanks for the files. Due to the holiday we may not be able to get to this for a couple of days though. Thanks for your patience.

    Geraldine Van der Auwera, PhD

  • rpoplinrpoplin Posts: 121GATK Developer mod

    HI there,

    I think one source of the confusion with TP/FPs on the tranche plot is that in the 2.6 plot all of the tranches are over the specified target ti/tv value and so there is no way to estimate TP/FPs (everything is assumed to be a TP) but in the 2.5 plot you have ti/tv that go above and below the target ti/tv value. Is this data from an exome sequencing project? If so you probably want to increases that target ti/tv and I think the tranche plots will make more sense. Take a look at the --target_titv argument in VariantRecalibrator.

    Let me know if that helps.

  • bredesonbredeson Posts: 20Member

    Hello rpoplin,
    It's been a while, but I'm revisiting this subject, as I do not yet fully understand how my truth=90 tranche can have cumulative [TF]Ps...

    I realigned around indels (which I thought I had done with the prior dataset) and then re-called (with UG) and re-ran VQSR with a --target_titv of 2.5 and the following plots were outputted. This result, as before, still contains tranches with cumulative [TF]Ps in the starting tranche, and it is still not clear to me how that is possible: is this an artifact of my training data (stringently filtered sites from the same dataset) or of the way I ran VariantRecalibrator (see log below)?

    Also, the Cumulative FPs bins overlap the Tranch-Specific TPs bins; this looks like a plotting error to me, but is this intended/meaningful? Again, my concern stems from the dissimilarity of my plot with the example plots here: http://www.broadinstitute.org/gatk/guide/article?id=39

    Thank you so much for your help!

    INFO 12:26:54,300 ArgumentTypeDescriptor - Dynamically determined type of /projectb/projectdirs/plant/analysis/cassava/genome/cassavaV4.1685010_168501.repeat.masked.srt.bed to be BED
    INFO 12:26:54,384 HelpFormatter - --------------------------------------------------------------------------------
    INFO 12:26:54,384 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02
    INFO 12:26:54,385 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 12:26:54,385 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 12:26:54,390 HelpFormatter - Program Args: -T VariantRecalibrator --reference_sequence /projectb/projectdirs/plant/analysis/cassava/genome/scaffolds+cpDNA.fasta --excludeIntervals /projectb/projectdirs/plant/analysis/cassava/genome/cassavaV4.1685010_168501.repeat.masked.srt.bed --mode SNP --resource:bootstrap,known=false,training=true,truth=true,prior=8.0 Manihot_UG.VF2+HW.cDP.vcf --input Manihot_UG.vcf --recal_file Manihot_UG.titv2_5.recal --tranches_file Manihot_UG.titv2_5.tranches --rscript_file Manihot_UG.titv2_5.Rplot -an BaseQRankSum -an FS -an InbreedingCoeff -an HaplotypeScore -an QD -an DP -an MQRankSum -an ReadPosRankSum --target_titv 2.5
    INFO 12:26:54,391 HelpFormatter - Date/Time: 2013/08/06 12:26:54
    INFO 12:26:54,391 HelpFormatter - --------------------------------------------------------------------------------
    INFO 12:26:54,391 HelpFormatter - --------------------------------------------------------------------------------
    INFO 12:26:54,456 ArgumentTypeDescriptor - Dynamically determined type of Manihot_UG.vcf to be VCF
    INFO 12:26:54,506 ArgumentTypeDescriptor - Dynamically determined type of Manihot_UG.VF2+HW.cDP.vcf to be VCF
    INFO 12:26:55,427 GenomeAnalysisEngine - Strictness is SILENT
    INFO 12:26:56,636 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 12:26:56,678 RMDTrackBuilder - Loading Tribble index from disk for file Manihot_UG.vcf
    INFO 12:26:57,559 RMDTrackBuilder - Loading Tribble index from disk for file Manihot_UG.VF2+HW.cDP.vcf
    INFO 12:26:58,768 IntervalUtils - Initial include intervals span 532668733 loci; exclude intervals span 7340626 loci
    INFO 12:26:58,772 IntervalUtils - Excluding 7340626 loci from original intervals (1.38% reduction)
    INFO 12:26:58,773 IntervalUtils - Processing 525328107 bp from intervals
    INFO 12:26:59,279 GenomeAnalysisEngine - Creating shard strategy for 0 BAM files
    INFO 12:26:59,317 GenomeAnalysisEngine - Done creating shard strategy
    INFO 12:26:59,318 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 12:26:59,318 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
    INFO 12:26:59,326 TrainingSet - Found bootstrap track: Known = false Training = true Truth = true Prior = Q8.0
    ...
    INFO 13:08:36,396 ProgressMeter - scaffold12914:16853 3.04e+07 41.6 m 82.0 s 99.7% 41.8 m 8.0 s
    INFO 13:08:47,619 VariantDataManager - BaseQRankSum: mean = 1.09 standard deviation = 1.86
    INFO 13:08:53,561 VariantDataManager - FS: mean = 7.53 standard deviation = 9.53
    INFO 13:08:59,407 VariantDataManager - InbreedingCoeff: mean = 0.17 standard deviation = 0.36
    INFO 13:09:05,407 VariantDataManager - HaplotypeScore: mean = 6.81 standard deviation = 6.65
    INFO 13:09:06,399 ProgressMeter - scaffold12977:1750 3.05e+07 42.1 m 82.0 s 100.0% 42.1 m 0.0 s
    INFO 13:09:11,277 VariantDataManager - QD: mean = 7.96 standard deviation = 5.89
    INFO 13:09:17,085 VariantDataManager - DP: mean = 567.93 standard deviation = 267.48
    INFO 13:09:22,875 VariantDataManager - MQRankSum: mean = 0.00 standard deviation = 2.74
    INFO 13:09:28,635 VariantDataManager - ReadPosRankSum: mean = 0.73 standard deviation = 1.75
    INFO 13:09:36,401 ProgressMeter - scaffold12977:1750 3.05e+07 42.6 m 83.0 s 100.0% 42.6 m 0.0 s
    INFO 13:09:37,230 VariantDataManager - Training with 13035063 variants after standard deviation thresholding.
    INFO 13:09:37,291 GaussianMixtureModel - Initializing model with 30 k-means iterations...
    ...
    INFO 18:07:42,225 VariantRecalibratorEngine - Evaluating full set of 17522718 variants...
    INFO 18:08:04,181 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.2 m 100.0% 5.7 h 0.0 s
    INFO 18:08:34,183 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.2 m 100.0% 5.7 h 0.0 s
    INFO 18:09:04,185 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.2 m 100.0% 5.7 h 0.0 s
    INFO 18:09:34,187 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.2 m 100.0% 5.7 h 0.0 s
    INFO 18:10:04,190 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.2 m 100.0% 5.7 h 0.0 s
    INFO 18:10:34,192 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.3 m 100.0% 5.7 h 0.0 s
    INFO 18:10:37,827 VariantDataManager - Found 0 variants overlapping bad sites training tracks.
    INFO 18:11:04,194 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.3 m 100.0% 5.7 h 0.0 s
    INFO 18:11:26,512 VariantDataManager - Additionally training with worst 3.000% of passing data --> 525682 variants with LOD <= -14.8842.
    INFO 18:11:26,513 GaussianMixtureModel - Initializing model with 30 k-means iterations...
    INFO 18:11:34,196 ProgressMeter - scaffold12977:1750 3.05e+07 5.7 h 11.3 m 100.0% 5.7 h 0.0 s
    INFO 18:12:04,199 ProgressMeter - scaffold12977:1750 3.05e+07 5.8 h 11.3 m 100.0% 5.8 h 0.0 s
    INFO 18:12:34,201 ProgressMeter - scaffold12977:1750 3.05e+07 5.8 h 11.3 m 100.0% 5.8 h 0.0 s
    INFO 18:12:35,512 VariantRecalibratorEngine - Finished iteration 0.
    INFO 18:13:04,203 ProgressMeter - scaffold12977:1750 3.05e+07 5.8 h 11.3 m 100.0% 5.8 h 0.0 s
    INFO 18:13:34,205 ProgressMeter - scaffold12977:1750 3.05e+07 5.8 h 11.4 m 100.0% 5.8 h 0.0 s
    ...
    INFO 18:22:10,281 VariantRecalibratorEngine - Convergence after 36 iterations!
    ...
    INFO 19:06:35,704 VariantRecalibrator - Executing: Rscript /global/projectb/scratch/bredeson/Manihot_esculenta/WGS_HQ/UnifiedGenotyper_Realigned/Manihot_UG.titv2_5.Rplot
    ...
    INFO 19:11:17,745 VariantRecalibrator - Executing: Rscript (resource)org/broadinstitute/sting/gatk/walkers/variantrecalibration/plot_Tranches.R /global/projectb/scratch/bredeson/Manihot_esculenta/WGS_HQ/UnifiedGenotyper_Realigned/Manihot_UG.titv2_5.tranches 2.5
    ...
    INFO 19:11:32,736 ProgressMeter - done 3.05e+07 6.7 h 13.3 m 100.0% 6.7 h 0.0 s
    INFO 19:11:32,736 ProgressMeter - Total runtime 24273.42 secs, 404.56 min, 6.74 hours
    INFO 19:12:07,556 GATKRunReport - Uploaded run statistics report to AWS S3

    Screen Shot 2013-08-26 at 6.24.04 PM.png
    745 x 453 - 43K
    Screen Shot 2013-08-26 at 6.36.55 PM.png
    644 x 662 - 103K
  • bredesonbredeson Posts: 20Member
    edited August 2013

    @rpoplin: To answer your previous question, this is WGS data.

    Post edited by bredeson on
Sign In or Register to comment.