Genotypes in vcf with a GQ=0

Hello,

I'm using GATK version 3.7 and I have a problem similar to the point raised in 2015 and titled "Unexpected genotype likelihoods & genotype qualities".
I've used HaplotypeCaller to generate individual GVCFs and after that, have used GenotypeGVCFs on multiple individual samples.
In the resulting vcf, I have some genotypes 0/0 with a GQ=0, whereas I was waiting for genotypes ./. (I suppose that when 2 genotypes have the same probability, the genotype should be unknown).

Could you please tell me how to do to force GATK to put ./. when GQ=0 ?
I've tried the GenotypeGVCFs command with teh version 3.8 and there's the same problem.

Thank very much.
Fabienne

Answers

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi Fabienne,

    You could "SelectVariants" with --set-filtered-gt-to-nocall for this purpose.
    Also, here is a document explaining the feature: https://software.broadinstitute.org/gatk/documentation/article?id=12350

    Please let me know if this helps. :)

    Regards
    Bhanu

  • vshastryvshastry Laramie, WyomingMember
    edited December 2018

    Hello Bhanu,

    For the same locus, is it ok to use the PL field in downstream analysis as accurate data? In other words, is the PL field reliable since it contains likelihood information?
    Sometimes the PL field is shown as "0,0,290" for that site so I wasn't sure if that is relevant information to be used.

    Thanks,
    Vivaswat

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @vshastry I am following up for Bhanu and will get back to you with an answer about the reliability of the PL field!

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi there @vshastry, I have been looking into your question and I think this discussion on how PL is calculated might be helpful. Because the likelihoods are normalized to have the most likely allele value at 0 and the higher numbers representing the lowest likelihood, the values in the vcf are not true Phred values that can be used downstream. However, it can give you an idea of which alleles are very rare.

    In the case of the reference allele of A with an output of 0,0,290, as you noted, the most likely genotypes are A/A and A/T, with T/T being very rare in relation to the two very likely genotypes. However, these are not absolute PL values, so extending the comparison to another site would not be possible, so a meaningful downstream analysis using these numbers would not be possible either.

    Does this make sense? I encourage you to take a look at the page about how these values are calculated.

  • vshastryvshastry Laramie, WyomingMember

    Hey @AdelaideR , thanks for getting back to me so quickly!
    Yes, that makes sense. But the problem with these type of sites is that the GQ field is 0. So now the Raw PL is the same as the Normalized PL.

    Here is a snippet of a vcf file generated with GenotypeGVCF on GATK v3.7:
    scaffold_2 2706 . T C 20759.39 PASS AC=28;AF=0.197;AN=142;BaseQRankSum=0.189;ClippingRankSum=0.00;DP=4214;ExcessHet=42.6387;FS=6.454;InbreedingCoeff=-0.4303;MQ=60.75;MQRankSum=-3.445e+00;QD=10.61;ReadPosRankSum=-1.196e+00;SOR=1.251;set=snps GT:AD:DP:FT:GQ:PGT:PID:PL 0/0:219,0:219:PASS:0:.:.:0,0,3114 0/0:120,0:120:PASS:0:.:.:0,0,677 0/0:152,0:152:PASS:0:.:.:0,0,2385 0/0:178,0:178:PASS:0:.:.:0,0,3376 0/1:59,33:92:PASS:99:0|1:2640_A_G:1162,0,9397 ./.:0,0:.:PASS 0/0:106,0:106:PASS:0:.:.:0,0,2077 0/1:68,69:137:PASS:99:.:.:2470,0,2648 0/0:130,0:130:PASS:0:.:.:0,0,1434 0/0:389,0:389:PASS:0:.:.:0,0,6931 0/0:196,0:196:PASS:0:.:.:0,0,4256 0/0:197,0:197:PASS:0:.:.:0,0,1323 0/0:168,0:168:PASS:0:.:.:0,0,3749 0/0:163,0:163:PASS:0:.:.:0,0,1507 ./.:0,0:.:PASS 0/0:156,0:156:PASS:0:.:.:0,0,1257 0/0:115,0:115:PASS:0:.:.:0,0,2579 0/0:103,0:103:PASS:0:.:.:0,0,2068 0/0:58,0:58:PASS:0:.:.:0,0,698 0/1:50,15:65:PASS:99:.:.:369,0,1797 0/0:89,0:89:PASS:0:.:.:0,0,1310 ./.:0,0:.:PASS 0/0:94,0:94:PASS:0:.:.:0,0,1854 0/0:80,0:80:PASS:0:.:.:0,0,1819 0/0:79,0:79:PASS:0:.:.:0,0,969 0/0:52,0:52:PASS:0:.:.:0,0,1040 0/0:79,0:79:PASS:0:.:.:0,0,1288 0/0:37,0:37:PASS:0:.:.:0,0,29 0/0/0/1:15,3:18:PASS:20:.:.:48,0,20,63,663 0/0/0/0:50,0:50:PASS:0:.:.:0,0,0,0,575 0/0/0/1:16,2:18:PASS:22:.:.:58,0,22,67,688 0/0/0/0:70,0:70:PASS:0:.:.:0,0,0,0,1178 0/0/0/0:42,0:42:PASS:54:.:.:0,54,129,259,2363 0/0/0/0:109,0:109:PASS:0:.:.:0,0,0,0,589 0/0/0/1:50,11:61:PASS:58:.:.:264,0,58,193,2036 0/0/0/0:161,0:161:PASS:0:.:.:0,0,0,0,1782 0/1/1/1:3,20:23:PASS:26:.:.:820,81,26,0,149 0/0/0/0:81,0:81:PASS:0:.:.:0,0,0,0,2272 0/1/1/1:11,26:37:PASS:13:.:.:1015,72,13,0,965 0/0/0/0:67,0:67:PASS:0:.:.:0,0,0,0,3 0/0/0/0:54,0:54:PASS:0:.:.:0,0,0,0,225 0/0/0/0:15,0:15:PASS:0:.:.:0,0,0,0,129 0/0/1/1:5,5:10:PASS:6:.:.:187,6,0,6,174 0/0/1/1:3,5:8:PASS:0:.:.:150,10,0,0,102 ./.:4,0:4:DP:5:.:.:0,5,12,24,171 ./.:2,4:6:DP:1:.:.:159,10,1,0,71 0/0/0/0:20,0:20:PASS:0:.:.:0,0,0,0,329 1/1/1/1:0,8:8:PASS:10:.:.:349,48,24,10,0 0/0/0/0:33,0:33:PASS:0:.:.:0,0,0,0,490 1/1/1/1:0,11:11:PASS:14:.:.:438,65,33,14,0 0/0/0/0:23,0:23:PASS:0:.:.:0,0,0,0,302 ./.:0,2:2:DP:2:.:.:81,12,6,2,0 1/1/1/1:0,8:8:PASS:10:.:.:360,48,24,10,0 0/0/0/0:21,0:21:PASS:29:.:.:0,29,69,137,999

    I've highlighted some individuals that exhibit this behavior. From the AD field it is clear that these individuals have a hom-ref but this isn't reflected in the PL field (which shows that multiple dosages are equally likely even though the reads point to a hom-ref). The genotype (GT) that is called seems to be right but if I'm using the PL field, I discard the GT.
    I believe that there is something wrong with the numbers being reported in the PL field but I'm not sure.

    I hope what I'm trying to say makes sense.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @vshastry

    I know this can be confusing, but here is an explanation that might help you.
    The value of GQ is simply the difference between the second lowest PL and the lowest PL (which is always 0).
    So because the PL of 0/0 and 0/1 is the same, GQ is 0.
    Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another.

    Read this document thoroughly, it will help understand these concepts better: https://software.broadinstitute.org/gatk/documentation/article?id=11075

  • vshastryvshastry Laramie, WyomingMember
    edited December 2018

    Hi all,
    Thank you for your responses.
    I understand the calculations for the PL field but I was surprised at how these numbers could arise given the read data. For example, if we have AD field=10,0 i.e. 10 reads of reference A and 0 reads of alternate T:
    P(AA | data) = 0.90
    P(AT | data) = 0.09
    P(TT | data) = 0.01

    (Approximate values)
    Raw PL: -10*log(0.90) = 0       -10*log(0.09) = 10       -10*log(0.01) = 20
    Normalized PL = Raw PL here since the lowest PL is already 0.

    This is similar to the fields highlighted in the sample vcf file above, yet the PL that is reported is not something like 0, 10, 20 but something like 0, 0, 29 (for example), which is clearly not the case.

    From another question in the forum (https://gatkforums.broadinstitute.org/gatk/discussion/8137/gq-score-weirdness-zero-vs-99) I learned that this only arises when using GenotypeGVCF since it gets these reads at the site but this site is then marked as inactive by HaplotypeCaller so the PL field can be messed up.

    In summary, it looks like I cannot use the PL field for any downstream analyses if it contains this type of data.

    But as a suggestion to the developers, I think these type of sites should be called as ./. (missing genotype) instead of calling an actual genotype on it. This could be misleading to folks who use the GT field in their analyses and incorporate these sites in their model.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @vshastry

    Thank you for the update and the detailed explanation. I will bring this up with the dev team.

  • vshastryvshastry Laramie, WyomingMember

    Hello @bhanuGandham,

    A follow-up on the previous posts. I found this thread from a couple of years ago in which the same problem ("printing ./. in the GT field when you have GQ=0") cropped up. The moderators said it had been fixed in a nightly build but I don't think it has.
    Hopefully this gives it more context.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @vshastry

    Sorry for the delay in getting back to you. Our team was on a winter break and we are currently playing catchup and hence has taken longer than we expected to get back to you. I apologize for that.
    We have not forgotten about this issue, I am meeting with the dev team on Monday to discuss this. I will get back to you by end of day Monday with a suggestion/solution.

    Thank you for your patience.

  • vshastryvshastry Laramie, WyomingMember

    Oh no worries at all, I totally understand.
    I stumbled onto that post and I thought it would be relevant. I did not mean to come across as being demanding, sorry.
    Hope you had a good winter break.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Hi Vivaswat (@vshastry),

    I've seen this GQ0 issue with the [0,0,X] PLs before. In my experience, it happens where there's some alternate haplotype that wasn't well captured by the assembly, so when the reference confidence likelihoods are computed, there's actually a lot of support for the symbolic allele (where we aggregate all the evidence like mismatches, bases next to indels, and soft-clips). If a site has roughly >=10% of the reads supporting the it will get genotyped as a het for one copy of the reference allele and one copy of a allele, but we don't want to call a symbolic allele, so we cap the likelihoods to favor homRef. In that case your PLs were probably something like [500, 0, 1000], which becomes [0, 0, 1500]. The case where the variant that doesn't get assembled is homVar will yield PL=[0,0,0] in the end, which I believe we do convert to no-call. There was a philosophical discussion some time ago and we decided not to change [0,0,X] PLs to no-call. Unfortunately development on GATK3 is officially at an end, but this is an update we could potentially add to GATK4. It's also possible this issue is the same case that I've already fixed in GATK4: https://github.com/broadinstitute/gatk/pull/5172 Is using GATK4 an option for you?

    -Laura

Sign In or Register to comment.