We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
VSQR filtered true rare variants

I use GATK to call variants and variant quality recalibration. What I have is usually several samples of different diseases and I am looking for the rare mutations that are responsible for each disease. To utilize VSQR, I also add additional samples so that there are more than 30 samples when doing the joint calling. These additional samples are in house sequenced exomes with population and sequencing platform matched to the diseased ones. I thought the strategy worked quite well until I found that a disease causing variant was failed to pass VSQR filter. The VCF line of the variant is shown below (position is masked):
chr19 XXXXXXXX . C T 3130.29 VQSRTrancheSNP99.00to99.90 AC=2;AF=0.016;AN=128;BaseQRankSum=-3.027;ClippingRankSum=-2.056;DP=3336;FS=7.171;InbreedingCoeff=0.9834;MLEAC=2;MLEAF=0.016;MQ=59.62;MQ0=0;MQRankSum=0.865;NEGATIVE_TRAIN_SITE;QD=32.61;ReadPosRankSum=1.928;VQSLOD=-3.835e+00;culprit=InbreedingCoeff GT:AD:GQ:PL 0/0:31,0:99:0,111,1083 0/0:7,0:39:0,39,445 0/0:28,0:83:0,83,755 0/0:56,0:99:0,168,1983 0/0:78,0:99:0,288,4190 0/0:90,0:99:0,271,3267 0/0:85,0:99:0,254,3098 0/0:93,0:99:0,280,3435 0/0:91,0:99:0,271,3237 0/0:100,0:99:0,301,3764 0/0:92,0:99:0,277,3387 0/0:73,0:99:0,218,2595 0/0:57,0:99:0,171,2166 0/0:91,0:99:0,274,3398 0/0:89,0:99:0,331,5082 0/0:75,0:99:0,226,2798 0/0:77,0:99:0,232,2887 0/0:76,0:99:0,227,2861 0/0:95,0:99:0,286,3578 0/0:73,0:99:0,218,2622 1/1:0,96:99:3192,289,0 0/0:60,0:99:0,181,2172 0/0:60,0:99:0,181,2258 0/0:84,0:99:0,251,3033 0/0:62,0:99:0,187,2253 0/0:69,0:99:0,208,2551 0/0:71,0:99:0,214,2628 0/0:95,0:99:0,286,3474 0/0:76,0:99:0,227,2738 0/0:83,0:99:0,250,3070 0/0:78,0:99:0,233,2868 0/0:20,0:57:0,57,504 0/0:33,0:99:0,99,844 0/0:45,0:99:0,137,1218 0/0:40,0:99:0,119,1044 0/0:31,0:93:0,93,855 0/0:20,0:60:0,60,563 0/0:27,0:81:0,81,801 0/0:14,0:42:0,42,392 0/0:40,0:99:0,120,1219 0/0:42,0:99:0,125,1221 0/0:30,0:89:0,89,855 0/0:46,1:99:0,134,1320 0/0:65,0:99:0,193,1878 0/0:42,0:99:0,126,1307 0/0:30,0:90:0,90,9060/0:23,0:69:0,69,729 0/0:31,0:92:0,92,910 0/0:23,0:69:0,69,721 0/0:12,0:36:0,36,343 0/0:9,0:27:0,27,300 ./. 0/0:27,0:81:0,81,8050/0:5,0:15:0,15,146 0/0:24,0:72:0,72,711 0/0:27,0:81:0,81,813 0/0:27,0:81:0,81,813 0/0:31,0:92:0,92,935 0/0:14,0:42:0,42,409 0/0:42,0:99:0,126,1258 0/0:13,0:39:0,39,401 0/0:53,0:99:0,159,1575 0/0:37,0:99:0,106,982 0/0:54,0:99:0,162,1604 0/0:58,0:99:0,174,1743
As you can see, there's only one homozygous variant found on one patient and only reference allele for the other patients. As 'culprit=InbreedingCoeff' indicated, the variant was not pass the filter because of the coefficient solely. Quoting from @Geraldine_VdAuwera: "This is a measure of the level of inbreeding of a group of samples.", I assume the failure means VSQR thinks the variant is too rare to be true! Am I right about it? I don't really see any other problem with the variant, either read depth, base quality or mapping quality is OK.
Should I use the inbreedingCoeff for VSQR if variants are called from samples of numbers of different families with different diseases and the aim is to find the rare mutations?
Thanks
Best Answer
-
Geraldine_VdAuwera Cambridge, MA admin
Hi all,
Sorry for the delay -- after discussion with the devs I can confirm that the VQSR arguments doc is actually completely up to date. InbreedingCoeff is indeed back (for cohort sizes >10). MQ is applicable for SNPs but not for indels (see explanation in the article -- the problem with MQ is specific to indels).
These recommendations are in step with the latest changes to the HaplotypeCaller and the VQSR code in the 3.x version. Some were previously removed because of artifactual effects or adverse interactions, and were restored once the problems were corrected for.
In future we're going to set up a system to make sure that when the recommendations are updated, a notice will be posted explaining the rationale behind each change (like software release notes but for the parameters).
Answers
"measure of inbreeding" is the correct description of the metric, but I like to think of it as detecting abnormal distributions of genotypes. All homozygous will be about 1, all heterozygous will be about -1, and in HWE will be 0. I encountered the exact scenario you describe once upon a time, and so I no longer use InbreedingCoeff in VQSR.
I do have a secondary hard filtering step, however, because the all-het case is pretty likely to be an error (I think I filter InbreedingCoeff < -0.8)
We no longer include InbreedingCoeff in our Best Practices recommendations; I can't tell you if this kind of case is the reason why, but I wouldn't be surprised if it was.
InbreedingCoeff is mentioned twice:
http://gatk.vanillaforums.com/discussion/1259/what-vqsr-training-sets-arguments-should-i-use-for-my-specific-project
"For example, InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. If your study design has more than 10 samples then it is recommended to be included."
"InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. If your study design has more than 10 samples then it is recommended to be included."
Should the above be removed?
Yes, actually -- thanks for pointing them out. I've removed them.
Hi, I saw that the recommendation of using inbreedingCoeff is back to the VQSR best practice, does that mean the problem mentioned here won't happen with the newer version of GATK? It certainly will be great if the using of the parameter won't exclude true rare variants but false positives. Thanks.
@byb121,
No, they should still be out if I recall correctly -- can you please indicate which document you saw that in?
http://www.broadinstitute.org/gatk/guide/article?id=1259
I guess this is the latest one, because it was updated yesterday. By the way, the search box at the top right corner seems out of order, it can not find anything of this site.
That is indeed the right document, but that recommendation is not correct. I'll fix that asap.
The search box is currently broken unfortunately; I'm hoping to have that fixed later this week. Try using google with http://gatkforums.broadinstitute.org/ as search target. Sorry for the inconvenience.
@byb121 FYI, the search box is fixed.
Please check that document again (article?id=1259). I see InbreedingCoeff for both modes; I also see MQ for mode SNP, and QD for mode INDEL. Are some of these incorrect recommendations?
Hi @Oprah,
You're right -- let me check with the team to make absolutely sure what the latest recommendation should be, and I will update this later this afternoon. My apologies for the confusion.
Hi @Oprah,
I might missed something, but why MQ is not supposed to use in mode SNP? I thought it was a must.
Thanks,
Maybe MQRankSum makes MQ redundant?
Hi all,
Sorry for the delay -- after discussion with the devs I can confirm that the VQSR arguments doc is actually completely up to date. InbreedingCoeff is indeed back (for cohort sizes >10). MQ is applicable for SNPs but not for indels (see explanation in the article -- the problem with MQ is specific to indels).
These recommendations are in step with the latest changes to the HaplotypeCaller and the VQSR code in the 3.x version. Some were previously removed because of artifactual effects or adverse interactions, and were restored once the problems were corrected for.
In future we're going to set up a system to make sure that when the recommendations are updated, a notice will be posted explaining the rationale behind each change (like software release notes but for the parameters).