Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

QD in Join VCF vs its components in GVCFs does not match

sum732sum732 New YorkMember

Hello Forum Members and GATK Team,
I am trying to understand the calculations that goes into QD in GATK.

In the following its mentioned that the new versions calculate QD using AD:
https://www.broadinstitute.org/gatk/blog?id=3862

In joint VCF file I have following entry:
1 5474238 . C A 163.45 PASS AC=2;AF=0.071;AN=28;DP=609;FS=0.000;MQ=60.00;MQ0=0;QD=27.05;Samples=Sample1;VQSLOD=18.21;VariantType=SNP;culprit=MQ GT:AD:GQ:PL 1/1:0,3:12:175,12,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,00/0:0,0:0:0,0,0

In the individual GVCF files (all 14) I search for Chromosome 1 and Position 5474238, and there only one entry:
1 5474238 . C A, 142.28 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:12:175,12,0,175,12,175:0,0,2,1

Here the QUAL is 142.28 and AD is 3. How come in the joint VCF file, the QUAL becomes 163.45 and QD=27.05 with DP=609?

Above is simple example where only in one sample a GT was called that was different from other 0/0.

Lets take another example (bit complicated):
Joint VCF file entry:
1 24089531 . A G 882.89 PASS AC=13;AF=0.464;AN=28;BaseQRankSum=0.358;DP=159;FS=4.751;MQ=60.00;MQ0=0;MQRankSum=0.358;QD=10.77;ReadPosRankSum=0.00;Samples=Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7,Sample8,Sample9,Sample10;VQSLOD=13.36;VariantType=SNP;culprit=DP GT:AB:AD:GQ:PL 1/1:.:0,3:7:48,7,0 0/1:0.600:6,4:50:50,0,117 0/1:0.670:4,2:31:31,0,87 0/1:0.500:3,3:40:62,0,40 0/1:0.200:1,4:12:68,0,12 0/0:.:6,0:18:0,18,155 0/1:0.790:11,3:38:38,0,230 1/1:.:0,15:44:437,44,0 0/0:.:4,1:8:0,8,91 0/0:.:0,0:0:0,0,0 0/0:.:0,0:0:0,0,0 0/1:0.670:4,2:35:35,0,91 1/1:.:0,1:3:11,3,0 0/1:.:3,2:38:38,0,49

Entires in the GVCF files:
Sample1
1 24089531 . A G, 16.33 . DP=6;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:7:48,7,0,48,7,48:0,0,2,1
Sample2
1 24089531 . A G, 21.80 . BaseQRankSum=0.365;DP=12;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.741;ReadPosRankSum=-0.365 GT:AB:AD:GQ:PL:SB 0/1:0.600:6,4,0:50:50,0,117,68,128,196:6,0,3,1
Sample3
1 24089531 . A G, 38.03 . BaseQRankSum=-0.731;DP=6;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.731;ReadPosRankSum=0.731 GT:AB:AD:GQ:PL:SB 0/1:0.200:1,4,0:12:68,0,12,71,24,95:1,0,3,1
Sample4
1 24089531 . A G, 0 . DP=7;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 0/0:6,0,0:18:0,18,155,18,155,155:6,0,0,0
Sample5
1 24089531 . A G, 4.61 . BaseQRankSum=-0.720;DP=6;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.720;ReadPosRankSum=0.000 GT:AB:AD:GQ:PL:SB 0/1:0.670:4,2,0:31:31,0,87,43,93,136:4,0,2,0
Sample6
1 24089531 . A G, 32.77 . BaseQRankSum=0.406;DP=8;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.988;ReadPosRankSum=-0.406 GT:AB:AD:GQ:PL:SB 0/1:0.500:3,3,0:40:62,0,40,70,49,120:2,1,3,0

Sample7
1 24089531 . A G, 10.20 . BaseQRankSum=-0.156;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.778;ReadPosRankSum=-0.467 GT:AB:AD:GQ:PL:SB 0/1:0.790:11,3,0:38:38,0,230,70,239,309:11,0,3,0

Sample8
1 24089531 . A G, 10.20 . BaseQRankSum=-0.156;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.778;ReadPosRankSum=-0.467 GT:AB:AD:GQ:PL:SB 0/1:0.790:11,3,0:38:38,0,230,70,239,309:11,0,3,0

Sample9
1 24089531 . A G, 0 . BaseQRankSum=-0.731;DP=12;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.731;ReadPosRankSum=0.731 GT:AD:GQ:PL:SB 0/0:4,1,0:8:0,8,91,12,93,97:4,0,0,0

Sample14

Sample13

Sample10
1 24089531 . A G, 7.60 . BaseQRankSum=-1.380;DP=9;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.380;ReadPosRankSum=0.000 GT:AB:AD:GQ:PL:SB 0/1:0.670:4,2,0:35:35,0,91,47,97,144:4,0,2,0

Sample11
1 24089531 . A G, 0.05 . DP=1;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,1,0:3:11,3,0,11,3,11:0,0,0,0

Sample12
1 24089531 . A G, 9.31 . BaseQRankSum=0.358;DP=7;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.358;ReadPosRankSum=-0.358 GT:AD:GQ:PL:SB 0/1:3,2,0:38:38,0,49,47,55,102:0,0,0,0

Using AD only, I am still not able to the QD value or the QUAL values in the Joint VCF files.

GATK Version=3.2-2-gec30cee

Thanks in Advance.

Tagged:

Best Answer

Answers

  • sum732sum732 New YorkMember

    Hello,
    To follow up
    Is this definition still holds true to calculate QD:
    QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.

    What if there is no AD value?

    For one of the samples I extracted all the "format" data and notice following:
    2614,072 GT:AB:AD:GQ:PL:SB
    2627,232 GT:AD:GQ:PL:SB
    123,63,8277 GT:DP:GQ:MIN_DP:PL
    63 GT:GQ:PL:SB
    So of all the total entries in a GVCF files majority does not have an "AD" value. How calculate QD then.

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie admin

    @sum732
    Hi,

    Can you please confirm that you get the same results when using version 3.3?

    Thanks,
    Sheila

  • sum732sum732 New YorkMember

    Hello Sheila,
    Thanks for looking into this and responding.
    I have picked up this work done a few months ago. I don't think new version was used again, and ( if possible) newer version can be tested at a later stage.

    Do you think this is a bug? Is there any other explanation.
    To keep it simple, QD is based on AD Values only and it's calculated for each variant, across all samples and then normalized(the sum total is then used to divide the QUAL) is this correct.
    What happens when there is no AD?

    Best,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Older versions had some glitches with AD values so it could be a minor bug. But there may be some subtleties of the calculation that we need to clarify in the context of the reference confidence model.

  • sum732sum732 New YorkMember

    Hello Geraldine/Sheila,
    I had posted a comment yesterday and system said its "waiting approval". Request you to please release/approve/ that.

    For our analysis QD value (which a normalized val) is crucial. Without getting a complete understanding of it (how its calculated) cannot used it and can't move forward. Kindly elaborate on "subtleties of the calculation that we need to clarify in the context of the reference confidence model". What goes into it and how can I regenerate the values the match the report from GATK.
    Many Thanks.
    Best,

  • SheilaSheila Broad InstituteMember, Broadie admin

    @sum732

    Hi,

    I have asked two of our methods developers to clarify what exactly the calculation is, and I will get back to you when they respond.

    As for when there is no AD, you are looking at GVCF files. That is an intermediate file. The final vcf will have AD values for all sites.

    -Sheila

  • sum732sum732 New YorkMember

    Hi Shelia,

    Great. Will wait for their (->your) response.

    It sound bit strange that an AD for a variant if missing in intermediate files (GCVF), but will be present in the final output (joint VCF). So what value(s) are read to place a AD value for a sample if its missing in the individual and intermediate GVCF. I have also found that in some case there are no QD values for a variant call in joint VCFs (see below).

    Basically, I am retracing the steps to better understand the pipeline and for our analysis we need to identify the cause of difference and for that I need to be confident of all the values (calculations that goes into it). This is the reason I am finding these unexplained factors(? for a better word).

    e.g.(missing QD): AC=2;AF=0.071;AN=28;DP=590;FS=0.000;MQ=60.00;MQ0=0;Samples=Sample1;VQSLOD=18.23;VariantType=SNP;culprit=MQ
    e.g. (have):
    AC=2;AF=0.071;AN=28;DP=611;FS=0.000;MQ=60.00;MQ0=0;QD=30.59;Samples=Sample1;VQSLOD=19.91;VariantType=SNP;culprit=MQ

    Thanks for your understanding and patience.

    Best,

  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    @sum732 said:
    Hi Shelia,

    Great. Will wait for their (->your) response.

    It sound bit strange that an AD for a variant if missing in intermediate files (GCVF), but will be present in the final output (joint VCF). So what value(s) are read to place a AD value for a sample if its missing in the individual and intermediate GVCF.

    Looking at the examples GVCF that have around here there is no AD in hom-ref call blocks and there is AD when there is some significant evidence for a concrete alternative allele (i.e. anything except <NON-REF>). Do you have an example of a GVCF variant block with no AD?

    I have also found that in some case there are no QD values for a variant call in joint VCFs (see below).

    This is a bit more surprising to me... looking at those two very similar Info entries it seems strange that one would have it an the other not... it would be great if you could provide us with the full line that does not contain the QD and the steps to reproduce it (command and inputs focused on the problematic interval). This could be a bug.

    Basically, I am retracing the steps to better understand the pipeline and for our analysis we need to identify the cause of difference and for that I need to be confident of all the values (calculations that goes into it). This is the reason I am finding these unexplained factors(? for a better word).

    e.g.(missing QD): AC=2;AF=0.071;AN=28;DP=590;FS=0.000;MQ=60.00;MQ0=0;Samples=Sample1;VQSLOD=18.23;VariantType=SNP;culprit=MQ
    e.g. (have):
    AC=2;AF=0.071;AN=28;DP=611;FS=0.000;MQ=60.00;MQ0=0;QD=30.59;Samples=Sample1;VQSLOD=19.91;VariantType=SNP;culprit=MQ

    Thanks for your understanding and patience.

    Best,

  • sum732sum732 New YorkMember

    Hi Valentin.
    Thanks for the detailed response. I will try and post the entire line to help.

  • WimSWimS Member ✭✭
    edited February 2016

    Actually QD has a soft-limit value of 35. If the QD calculated with the formula above is larger than 35, which happens amongst the examples that you provide; then this value is discarded and we draw a random QD from a normal distribution with mean 30 and std.dev. 3. That in turn means that several runs on the same data could potentially return different QDs although all close enough to 30.

    I'm afraid I can't answer the obvious question why 35/30/3 or why simply don't impose such a limit, perhaps another colleague of mine can add more to the discussion once is back from vacation.

    I am curious what the answer is to the obvious 35/30/3 question. I also have sort of a custom situation where this formula might be of use. Is the colleague back who could answer this? Thank you.

    Issue · Github
    by Sheila

    Issue Number
    549
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'm not sure why the 35 cap; my guess is that this number was found empirically to work well for the purposes of variant filtering. The 30/3 part derives from 35 in that we want to introduce jitter into the value (to avoid having a huge pile of QDs of exactly 35) so that the annotation will behave well in VQSR, and again those specific numbers were probably found to work well to produce an adequate distribution of values.

  • WimSWimS Member ✭✭

    @Geraldine_VdAuwera The link to VQSR seems to be probable. Thank you for the answer.

Sign In or Register to comment.