We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

#### Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.

# QD in Join VCF vs its components in GVCFs does not match

New YorkMember

Hello Forum Members and GATK Team,
I am trying to understand the calculations that goes into QD in GATK.

In the following its mentioned that the new versions calculate QD using AD:

In joint VCF file I have following entry:
1 5474238 . C A 163.45 PASS AC=2;AF=0.071;AN=28;DP=609;FS=0.000;MQ=60.00;MQ0=0;QD=27.05;Samples=Sample1;VQSLOD=18.21;VariantType=SNP;culprit=MQ GT:AD:GQ:PL 1/1:0,3:12:175,12,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,00/0:0,0:0:0,0,0

In the individual GVCF files (all 14) I search for Chromosome 1 and Position 5474238, and there only one entry:
1 5474238 . C A, 142.28 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:12:175,12,0,175,12,175:0,0,2,1

Here the QUAL is 142.28 and AD is 3. How come in the joint VCF file, the QUAL becomes 163.45 and QD=27.05 with DP=609?

Above is simple example where only in one sample a GT was called that was different from other 0/0.

Lets take another example (bit complicated):
Joint VCF file entry:
1 24089531 . A G 882.89 PASS AC=13;AF=0.464;AN=28;BaseQRankSum=0.358;DP=159;FS=4.751;MQ=60.00;MQ0=0;MQRankSum=0.358;QD=10.77;ReadPosRankSum=0.00;Samples=Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7,Sample8,Sample9,Sample10;VQSLOD=13.36;VariantType=SNP;culprit=DP GT:AB:AD:GQ:PL 1/1:.:0,3:7:48,7,0 0/1:0.600:6,4:50:50,0,117 0/1:0.670:4,2:31:31,0,87 0/1:0.500:3,3:40:62,0,40 0/1:0.200:1,4:12:68,0,12 0/0:.:6,0:18:0,18,155 0/1:0.790:11,3:38:38,0,230 1/1:.:0,15:44:437,44,0 0/0:.:4,1:8:0,8,91 0/0:.:0,0:0:0,0,0 0/0:.:0,0:0:0,0,0 0/1:0.670:4,2:35:35,0,91 1/1:.:0,1:3:11,3,0 0/1:.:3,2:38:38,0,49

Entires in the GVCF files:
Sample1
1 24089531 . A G, 16.33 . DP=6;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:7:48,7,0,48,7,48:0,0,2,1
Sample2
Sample3
Sample4
1 24089531 . A G, 0 . DP=7;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 0/0:6,0,0:18:0,18,155,18,155,155:6,0,0,0
Sample5
Sample6

Sample7

Sample8

Sample9

Sample14

Sample13

Sample10

Sample11
1 24089531 . A G, 0.05 . DP=1;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,1,0:3:11,3,0,11,3,11:0,0,0,0

Sample12

Using AD only, I am still not able to the QD value or the QUAL values in the Joint VCF files.

GATK Version=3.2-2-gec30cee

Tagged:

• New YorkMember

Hello,
Is this definition still holds true to calculate QD:
QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.

What if there is no AD value?

For one of the samples I extracted all the "format" data and notice following:
123,63,8277 GT:DP:GQ:MIN_DP:PL
63 GT:GQ:PL:SB
So of all the total entries in a GVCF files majority does not have an "AD" value. How calculate QD then.

Thanks

@sum732
Hi,

Can you please confirm that you get the same results when using version 3.3?

Thanks,
Sheila

• New YorkMember

Hello Sheila,
Thanks for looking into this and responding.
I have picked up this work done a few months ago. I don't think new version was used again, and ( if possible) newer version can be tested at a later stage.

Do you think this is a bug? Is there any other explanation.
To keep it simple, QD is based on AD Values only and it's calculated for each variant, across all samples and then normalized(the sum total is then used to divide the QUAL) is this correct.
What happens when there is no AD?

Best,

Older versions had some glitches with AD values so it could be a minor bug. But there may be some subtleties of the calculation that we need to clarify in the context of the reference confidence model.

• New YorkMember

Hello Geraldine/Sheila,
I had posted a comment yesterday and system said its "waiting approval". Request you to please release/approve/ that.

For our analysis QD value (which a normalized val) is crucial. Without getting a complete understanding of it (how its calculated) cannot used it and can't move forward. Kindly elaborate on "subtleties of the calculation that we need to clarify in the context of the reference confidence model". What goes into it and how can I regenerate the values the match the report from GATK.
Many Thanks.
Best,

@sum732

Hi,

I have asked two of our methods developers to clarify what exactly the calculation is, and I will get back to you when they respond.

As for when there is no AD, you are looking at GVCF files. That is an intermediate file. The final vcf will have AD values for all sites.

-Sheila

• New YorkMember

Hi Shelia,

Great. Will wait for their (->your) response.

It sound bit strange that an AD for a variant if missing in intermediate files (GCVF), but will be present in the final output (joint VCF). So what value(s) are read to place a AD value for a sample if its missing in the individual and intermediate GVCF. I have also found that in some case there are no QD values for a variant call in joint VCFs (see below).

Basically, I am retracing the steps to better understand the pipeline and for our analysis we need to identify the cause of difference and for that I need to be confident of all the values (calculations that goes into it). This is the reason I am finding these unexplained factors(? for a better word).

e.g.(missing QD): AC=2;AF=0.071;AN=28;DP=590;FS=0.000;MQ=60.00;MQ0=0;Samples=Sample1;VQSLOD=18.23;VariantType=SNP;culprit=MQ
e.g. (have):
AC=2;AF=0.071;AN=28;DP=611;FS=0.000;MQ=60.00;MQ0=0;QD=30.59;Samples=Sample1;VQSLOD=19.91;VariantType=SNP;culprit=MQ

Thanks for your understanding and patience.

Best,

• Cambridge, MAMember, Dev ✭✭

@sum732 said:
Hi Shelia,

Great. Will wait for their (->your) response.

It sound bit strange that an AD for a variant if missing in intermediate files (GCVF), but will be present in the final output (joint VCF). So what value(s) are read to place a AD value for a sample if its missing in the individual and intermediate GVCF.

Looking at the examples GVCF that have around here there is no AD in hom-ref call blocks and there is AD when there is some significant evidence for a concrete alternative allele (i.e. anything except <NON-REF>). Do you have an example of a GVCF variant block with no AD?

I have also found that in some case there are no QD values for a variant call in joint VCFs (see below).

This is a bit more surprising to me... looking at those two very similar Info entries it seems strange that one would have it an the other not... it would be great if you could provide us with the full line that does not contain the QD and the steps to reproduce it (command and inputs focused on the problematic interval). This could be a bug.

Basically, I am retracing the steps to better understand the pipeline and for our analysis we need to identify the cause of difference and for that I need to be confident of all the values (calculations that goes into it). This is the reason I am finding these unexplained factors(? for a better word).

e.g.(missing QD): AC=2;AF=0.071;AN=28;DP=590;FS=0.000;MQ=60.00;MQ0=0;Samples=Sample1;VQSLOD=18.23;VariantType=SNP;culprit=MQ
e.g. (have):
AC=2;AF=0.071;AN=28;DP=611;FS=0.000;MQ=60.00;MQ0=0;QD=30.59;Samples=Sample1;VQSLOD=19.91;VariantType=SNP;culprit=MQ

Thanks for your understanding and patience.

Best,

• New YorkMember

Hi Valentin.
Thanks for the detailed response. I will try and post the entire line to help.

• Member ✭✭
edited February 2016

Actually QD has a soft-limit value of 35. If the QD calculated with the formula above is larger than 35, which happens amongst the examples that you provide; then this value is discarded and we draw a random QD from a normal distribution with mean 30 and std.dev. 3. That in turn means that several runs on the same data could potentially return different QDs although all close enough to 30.

I'm afraid I can't answer the obvious question why 35/30/3 or why simply don't impose such a limit, perhaps another colleague of mine can add more to the discussion once is back from vacation.

I am curious what the answer is to the obvious 35/30/3 question. I also have sort of a custom situation where this formula might be of use. Is the colleague back who could answer this? Thank you.

Issue Number
549
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera