We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
QD in Join VCF vs its components in GVCFs does not match

Hello Forum Members and GATK Team,
I am trying to understand the calculations that goes into QD in GATK.
In the following its mentioned that the new versions calculate QD using AD:
https://www.broadinstitute.org/gatk/blog?id=3862
In joint VCF file I have following entry:
1 5474238 . C A 163.45 PASS AC=2;AF=0.071;AN=28;DP=609;FS=0.000;MQ=60.00;MQ0=0;QD=27.05;Samples=Sample1;VQSLOD=18.21;VariantType=SNP;culprit=MQ GT:AD:GQ:PL 1/1:0,3:12:175,12,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,00/0:0,0:0:0,0,0
In the individual GVCF files (all 14) I search for Chromosome 1 and Position 5474238, and there only one entry:
1 5474238 . C A, 142.28 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:12:175,12,0,175,12,175:0,0,2,1
Here the QUAL is 142.28 and AD is 3. How come in the joint VCF file, the QUAL becomes 163.45 and QD=27.05 with DP=609?
Above is simple example where only in one sample a GT was called that was different from other 0/0.
Lets take another example (bit complicated):
Joint VCF file entry:
1 24089531 . A G 882.89 PASS AC=13;AF=0.464;AN=28;BaseQRankSum=0.358;DP=159;FS=4.751;MQ=60.00;MQ0=0;MQRankSum=0.358;QD=10.77;ReadPosRankSum=0.00;Samples=Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7,Sample8,Sample9,Sample10;VQSLOD=13.36;VariantType=SNP;culprit=DP GT:AB:AD:GQ:PL 1/1:.:0,3:7:48,7,0 0/1:0.600:6,4:50:50,0,117 0/1:0.670:4,2:31:31,0,87 0/1:0.500:3,3:40:62,0,40 0/1:0.200:1,4:12:68,0,12 0/0:.:6,0:18:0,18,155 0/1:0.790:11,3:38:38,0,230 1/1:.:0,15:44:437,44,0 0/0:.:4,1:8:0,8,91 0/0:.:0,0:0:0,0,0 0/0:.:0,0:0:0,0,0 0/1:0.670:4,2:35:35,0,91 1/1:.:0,1:3:11,3,0 0/1:.:3,2:38:38,0,49
Entires in the GVCF files:
Sample1
1 24089531 . A G, 16.33 . DP=6;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:7:48,7,0,48,7,48:0,0,2,1
Sample2
1 24089531 . A G, 21.80 . BaseQRankSum=0.365;DP=12;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.741;ReadPosRankSum=-0.365 GT:AB:AD:GQ:PL:SB 0/1:0.600:6,4,0:50:50,0,117,68,128,196:6,0,3,1
Sample3
1 24089531 . A G, 38.03 . BaseQRankSum=-0.731;DP=6;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.731;ReadPosRankSum=0.731 GT:AB:AD:GQ:PL:SB 0/1:0.200:1,4,0:12:68,0,12,71,24,95:1,0,3,1
Sample4
1 24089531 . A G, 0 . DP=7;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 0/0:6,0,0:18:0,18,155,18,155,155:6,0,0,0
Sample5
1 24089531 . A G, 4.61 . BaseQRankSum=-0.720;DP=6;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.720;ReadPosRankSum=0.000 GT:AB:AD:GQ:PL:SB 0/1:0.670:4,2,0:31:31,0,87,43,93,136:4,0,2,0
Sample6
1 24089531 . A G, 32.77 . BaseQRankSum=0.406;DP=8;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.988;ReadPosRankSum=-0.406 GT:AB:AD:GQ:PL:SB 0/1:0.500:3,3,0:40:62,0,40,70,49,120:2,1,3,0
Sample7
1 24089531 . A G, 10.20 . BaseQRankSum=-0.156;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.778;ReadPosRankSum=-0.467 GT:AB:AD:GQ:PL:SB 0/1:0.790:11,3,0:38:38,0,230,70,239,309:11,0,3,0
Sample8
1 24089531 . A G, 10.20 . BaseQRankSum=-0.156;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.778;ReadPosRankSum=-0.467 GT:AB:AD:GQ:PL:SB 0/1:0.790:11,3,0:38:38,0,230,70,239,309:11,0,3,0
Sample9
1 24089531 . A G, 0 . BaseQRankSum=-0.731;DP=12;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.731;ReadPosRankSum=0.731 GT:AD:GQ:PL:SB 0/0:4,1,0:8:0,8,91,12,93,97:4,0,0,0
Sample14
Sample13
Sample10
1 24089531 . A G, 7.60 . BaseQRankSum=-1.380;DP=9;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.380;ReadPosRankSum=0.000 GT:AB:AD:GQ:PL:SB 0/1:0.670:4,2,0:35:35,0,91,47,97,144:4,0,2,0
Sample11
1 24089531 . A G, 0.05 . DP=1;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,1,0:3:11,3,0,11,3,11:0,0,0,0
Sample12
1 24089531 . A G, 9.31 . BaseQRankSum=0.358;DP=7;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.358;ReadPosRankSum=-0.358 GT:AD:GQ:PL:SB 0/1:3,2,0:38:38,0,49,47,55,102:0,0,0,0
Using AD only, I am still not able to the QD value or the QUAL values in the Joint VCF files.
GATK Version=3.2-2-gec30cee
Thanks in Advance.
Best Answer
-
valentin Cambridge, MA ✭✭
Hi @sum732,
So as you noticed the QD is roughly speaking QUAL / Sum(Sum(AD)).
Notice however that samples with no AD are ignored and so are samples that have AD but all the alternative alleles counts are zero. In other words QD does not take in consideration samples that do not show evidence for a concrete alternative allele. If no sample has a compliant AD then effectively QUAL is probably 0 (or very close) and so QD (probably not even present).
However there are two important caveats here, two inaccuracy to the simple formula above.
A.)
First the Sum(Sum(AC)) is actually "normalized" using a weighted average length of the alternative alleles.
So a more accurate formula for QD is
QD = QUAL / (Sum(Sum(AD)) * alleleLengthNormalization(AC,ALT))
Where alleleLengthNormalization(AC,ALT) = Max(1.0,weightedAverageAltAlleleLength(AC,ALT) / 3.0)
Where weightedAverageAlleleLength(AC,ALT) = Sum(AC[i] * length(ALT[i]))/ Sum(AC).
Where length(A) = is difference in number of bases between A allele and the reference and if this is zero the number of base mismatches between both sequences (thus for a SNP it would be 1, and for a simple insertion of 4 bases it would be 4).
This normalization intent to reduce the QD for longer INDEL variants.
B.)
Actually QD has a soft-limit value of 35. If the QD calculated with the formula above is larger than 35, which happens amongst the examples that you provide; then this value is discarded and we draw a random QD from a normal distribution with mean 30 and std.dev. 3. That in turn means that several runs on the same data could potentially return different QDs although all close enough to 30.
I'm afraid I can't answer the obvious question why 35/30/3 or why simply don't impose such a limit, perhaps another colleague of mine can add more to the discussion once is back from vacation.
Answers
Hello,
To follow up
Is this definition still holds true to calculate QD:
QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
What if there is no AD value?
For one of the samples I extracted all the "format" data and notice following:
2614,072 GT:AB:AD:GQ:PL:SB
2627,232 GT:AD:GQ:PL:SB
123,63,8277 GT:DP:GQ:MIN_DP:PL
63 GT:GQ:PL:SB
So of all the total entries in a GVCF files majority does not have an "AD" value. How calculate QD then.
Thanks
@sum732
Hi,
Can you please confirm that you get the same results when using version 3.3?
Thanks,
Sheila
Hello Sheila,
Thanks for looking into this and responding.
I have picked up this work done a few months ago. I don't think new version was used again, and ( if possible) newer version can be tested at a later stage.
Do you think this is a bug? Is there any other explanation.
To keep it simple, QD is based on AD Values only and it's calculated for each variant, across all samples and then normalized(the sum total is then used to divide the QUAL) is this correct.
What happens when there is no AD?
Best,
Older versions had some glitches with AD values so it could be a minor bug. But there may be some subtleties of the calculation that we need to clarify in the context of the reference confidence model.
Hello Geraldine/Sheila,
I had posted a comment yesterday and system said its "waiting approval". Request you to please release/approve/ that.
For our analysis QD value (which a normalized val) is crucial. Without getting a complete understanding of it (how its calculated) cannot used it and can't move forward. Kindly elaborate on "subtleties of the calculation that we need to clarify in the context of the reference confidence model". What goes into it and how can I regenerate the values the match the report from GATK.
Many Thanks.
Best,
@sum732
Hi,
I have asked two of our methods developers to clarify what exactly the calculation is, and I will get back to you when they respond.
As for when there is no AD, you are looking at GVCF files. That is an intermediate file. The final vcf will have AD values for all sites.
-Sheila
Hi Shelia,
Great. Will wait for their (->your) response.
It sound bit strange that an AD for a variant if missing in intermediate files (GCVF), but will be present in the final output (joint VCF). So what value(s) are read to place a AD value for a sample if its missing in the individual and intermediate GVCF. I have also found that in some case there are no QD values for a variant call in joint VCFs (see below).
Basically, I am retracing the steps to better understand the pipeline and for our analysis we need to identify the cause of difference and for that I need to be confident of all the values (calculations that goes into it). This is the reason I am finding these unexplained factors(? for a better word).
e.g.(missing QD): AC=2;AF=0.071;AN=28;DP=590;FS=0.000;MQ=60.00;MQ0=0;Samples=Sample1;VQSLOD=18.23;VariantType=SNP;culprit=MQ
e.g. (have):
AC=2;AF=0.071;AN=28;DP=611;FS=0.000;MQ=60.00;MQ0=0;QD=30.59;Samples=Sample1;VQSLOD=19.91;VariantType=SNP;culprit=MQ
Thanks for your understanding and patience.
Best,
Hi @sum732,
So as you noticed the QD is roughly speaking QUAL / Sum(Sum(AD)).
Notice however that samples with no AD are ignored and so are samples that have AD but all the alternative alleles counts are zero. In other words QD does not take in consideration samples that do not show evidence for a concrete alternative allele. If no sample has a compliant AD then effectively QUAL is probably 0 (or very close) and so QD (probably not even present).
However there are two important caveats here, two inaccuracy to the simple formula above.
A.)
First the Sum(Sum(AC)) is actually "normalized" using a weighted average length of the alternative alleles.
So a more accurate formula for QD is
QD = QUAL / (Sum(Sum(AD)) * alleleLengthNormalization(AC,ALT))
Where alleleLengthNormalization(AC,ALT) = Max(1.0,weightedAverageAltAlleleLength(AC,ALT) / 3.0)
Where weightedAverageAlleleLength(AC,ALT) = Sum(AC[i] * length(ALT[i]))/ Sum(AC).
Where length(A) = is difference in number of bases between A allele and the reference and if this is zero the number of base mismatches between both sequences (thus for a SNP it would be 1, and for a simple insertion of 4 bases it would be 4).
This normalization intent to reduce the QD for longer INDEL variants.
B.)
Actually QD has a soft-limit value of 35. If the QD calculated with the formula above is larger than 35, which happens amongst the examples that you provide; then this value is discarded and we draw a random QD from a normal distribution with mean 30 and std.dev. 3. That in turn means that several runs on the same data could potentially return different QDs although all close enough to 30.
I'm afraid I can't answer the obvious question why 35/30/3 or why simply don't impose such a limit, perhaps another colleague of mine can add more to the discussion once is back from vacation.
Looking at the examples GVCF that have around here there is no AD in hom-ref call blocks and there is AD when there is some significant evidence for a concrete alternative allele (i.e. anything except <NON-REF>). Do you have an example of a GVCF variant block with no AD?
This is a bit more surprising to me... looking at those two very similar Info entries it seems strange that one would have it an the other not... it would be great if you could provide us with the full line that does not contain the QD and the steps to reproduce it (command and inputs focused on the problematic interval). This could be a bug.
Hi Valentin.
Thanks for the detailed response. I will try and post the entire line to help.
I am curious what the answer is to the obvious 35/30/3 question. I also have sort of a custom situation where this formula might be of use. Is the colleague back who could answer this? Thank you.
Issue · Github
by Sheila
I'm not sure why the 35 cap; my guess is that this number was found empirically to work well for the purposes of variant filtering. The 30/3 part derives from 35 in that we want to introduce jitter into the value (to avoid having a huge pile of QDs of exactly 35) so that the annotation will behave well in VQSR, and again those specific numbers were probably found to work well to produce an adequate distribution of values.
@Geraldine_VdAuwera The link to VQSR seems to be probable. Thank you for the answer.