It looks like you're new here. If you want to get involved, click one of these buttons!
Hi,
Recently I run into some odd observation in VQSR. I have 17 samples from a same family and I used all of 17 samples to call SNPs and after VQSR, I got the trench file like this:
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,48637,716,2.9527,2.3302,4.8390,VQSRTrancheSNP0.00to90.00,SNP,26182,23563,0.9000 99.00,60114,1531,2.8057,2.3333,1.7766,VQSRTrancheSNP90.00to99.00,SNP,26182,25920,0.9900 99.90,67220,2884,2.7190,1.8222,-10.0009,VQSRTrancheSNP99.00to99.90,SNP,26182,26155,0.9990 100.00,69714,4998,2.6822,1.8300,-1122.0698,VQSRTrancheSNP99.90to100.00,SNP,26182,26182,1.0000
which seems fine. then for research purpose, I only used 5 samples of more tight relation such as two parents and their 3 immediate children and after VQSR, the trench file looks like below:
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP0.00to90.00,SNP,20850,20850,1.0000 99.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP90.00to99.00,SNP,20850,20850,1.0000 99.90,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP99.00to99.90,SNP,20850,20850,1.0000 100.00,50598,2279,2.6625,1.7993,-Infinity,VQSRTrancheSNP99.90to100.00,SNP,20850,20850,1.0000
Notice that the 5-sample VQSR tranch file has exactly the same thing throughout all thresholds: 90, 99, 99.90 and 100. and the VQSR modeling plot is also very odd, no plotting at all being seen (the pdf ifle was created but was almost blank in contrast to the normal projection plots I saw in other cases)
However, we did use the old version to call the same 5 samples before, and the trench file looks like below:
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,36407,361,2.8657,2.3119,5.0854,TruthSensitivityTranche0.00to90.00,20814,18732,0.9000 99.00,44097,638,2.7655,2.2222,2.2592,TruthSensitivityTranche90.00to99.00,20814,20605,0.9900 99.90,47947,1061,2.7078,1.8750,-7.4143,TruthSensitivityTranche99.00to99.90,20814,20793,0.9990 100.00,50426,2318,2.6645,1.7677,-647.3944,TruthSensitivityTranche99.90to100.00,20814,20814,1.0000
this time, it looks reasonable to me. This is troubling us since for 5 samples, the old version (V1.6-7) seems working fine, whereas the new version (V2.1-13) seems having issue or can not get further filtering by VQSR (90, 99 and 100 got the same result, I did repeat multiple times and got the same results), although for all of the 17 samples, the new version seems fine on VQSR.
So my questions are: 1. is it possible that in some occasion, VQSR can simply not work? 2. Why the old version seems working but not the new version for exactly the same set of 5-sample data?
Thanks a lot for your help!
Mike
Answers
Hi,
After carefully checking, I found more info although still puzzled:
I compared the old and new version commands side by side and realized a major difference in Unified genotyper command, in which the old version use -dcov 50 as default option suggested at the time (I do have the old documnetation for old version saved as in pdf file) and now in new version (V2) it changed to -dcov 200 (-dcov [50 for 4x, 200 for >30x WGS or Whole exome], see URL:http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_genotyper_UnifiedGenotyper.html) and so for V2, I did use -dcov 200, which caused the odd result as mentioned above, More surprisingly, when I re-run the v2 GATK using UG with -dcov 50 like in old version for the same dataset (just 5 samples), I got similar result as the old version result. the tranch file looks like below, which seems much reasonable in terms of TiTv ratios improvement from 99.9 to 99 to 90 etc:
Variant quality score tranches file
Version number 5
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity 90.00,35484,316,2.9264,2.1600,4.6455,VQSRTrancheSNP0.00to90.00,SNP,20849,18764,0.9000 99.00,44135,624,2.7757,2.1515,1.9238,VQSRTrancheSNP90.00to99.00,SNP,20849,20640,0.9900 99.90,47937,1007,2.7121,1.8523,-7.7956,VQSRTrancheSNP99.00to99.90,SNP,20849,20828,0.9990 100.00,50469,2133,2.6676,1.7890,-777.6778,VQSRTrancheSNP99.90to100.00,SNP,20849,20849,1.000
Also the VQSR modeling plot looks also normal as I usually saw, whereas the one using UG -dcov 200 give a very odd plot (only one page and also almost no plot at all for the whole pdf file for the model plot)
So my question is: why the option of dcov in UG step (although I used what was suggested by both old and new version documentations) has such big impact on the VQSR later disregard of whether it is old or new version I am running?
Thanks a lot for any insights!
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
There seems to be a lot of details here. Can you please post the full command lines that you are using to help us understand what you are seeing.
Cheers,
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, Ryan:
Thanks for looking into this.
I just posted the UG steps that I found may be the major causes, let me know if you need other info:
java -jar /opt/nasapps/stow/GenomeAnalysisTK-2.2-4-g4a174fb/bin/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /path/hg19.fa
-I /path/S1.bam -I /path/S2.bam -I /path/S3.bam -I /path/S4.bam -I /path/S5.bam --dbsnp /Path/dbsnp_135.hg19.vcf -glm BOTH -o /Path/GATK_UG_5Samples_snps_indel.raw.afterRecal_2012_V2.vcf -stand_call_conf 50 -stand_emit_conf 10 -dcov 200 -L /Path/029720_D_BED_20101013.bed
After this, I used SelectVariants to get SNPs only vcf file, which was subjected to VQSR step, then I got the odd VQSR tranches file as shown above
However, for the same set of 5 samples, for UG step like above command, I just use -dcov 50 instead of -dcov 200, then I got the reasonable tranches file shown above (in my 2nd post).
Is it odd? Did I do something wrong here?
Thanks
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
I see that you are using an intervals list:
-L /Path/029720_D_BED_20101013.bedIs this by any chance a targeted exome or smaller project? We don't recommend using the VQSR on such projects with small numbers of samples. My guess is that the fluctuations you are seeing don't have anything to do with the downsampling in the UG but are rather a sign of numerical instability in the VQSR with small numbers of variants. Take a look at our best practice recommendations for advice on how you might be able to use the VQSR with your project.I hope that helps,
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, Ryan:
Thanks for the input!
yes, it is a small project but as whole exome seq (the bed file is the whole exome target regions from agilent) with just a family of individuals with some of them with a type of cancer suspected to be inheritance-related or genetic cause. I double check the documentation of VQSR, which emphasize on the number of variant sites in the callsets at range of thousands of variants (except for using InbreedingCoeff needs at least 10 samples), and my dataset does have about 50k variants after UG for only 5 samples ( a subset of the family with Trio relations) and 70 k for all members of the big family (17), so I though it is reasonable to use VQSR, does it? I did on just 5 of them is because these 5 samples are with trio relation, we just want to see any improvement of the calls. I did do the variant calls and VQSR on all 17 samples of the whole family as well. which looks fine. So even if I have 50k variants in the 5-sample callset, the number of samples as 5 still not good enough?
I just wonder why even for 5 sample callset, after using -docv 50 on UG step, and it seems now that VQSR has the separation for 0.90 and 0.99 thresholds as mentioned above, compared to if use -dcov 200?
Any insight?
Thanks again
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, Ryan:
I did try VQSR with --maxGaussians 4 (used to be --maxGaussians 6), now I got tranches file as below, which looks fine with me now:
did not realize that --maxGaussians has such great impact on this. What do you think?
Thanks and best
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •