Over estimation of AF in Mutect2 (GATK ?

fengxiefengxie ChinaMember


I ran Mutect2 (GATK followed by FilterMutectCalls with default parameters. I got some passed variants with AF much larger than the alt_depth/total_depth (see attached image), I checked the reads mapping qualities and most of them are good, so reads are not likely filtered by Mutect2.

Why is the AF larger than the alt_depth/total_depth?

I noted that the three variants with AF overestimated also with reads orientation bias, can Mutect2 filter variants by reads orientation bias?


chrX 41000336 . A G . PASS DP=2025;ECNT=1;NLOD=241.68;N_ART_LOD=-6.740e-01;POP_AF=1.000e-03;P_GERMLINE=-2.387e+02;TLOD=5.88 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:1094,7:0.030:563,1:531,6:33:200,195:60:33:false:false:.:.:100.00:51.51:0.010,0.010,6.358e-03:6.461e-04,1.733e-03,0.998 0/0:828,2:0.028:422,1:406,1:34:199,181:60:19:false:false
chrX 66766308 . A G . PASS DP=1691;ECNT=2;NLOD=90.81;N_ART_LOD=-1.323e+00;POP_AF=1.000e-03;P_GERMLINE=-8.780e+01;TLOD=6.03 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:1292,8:0.029:578,0:714,8:32:291,291:60:38:false:false:.:.:100.00:51.51:0.010,0.00,6.154e-03:7.151e-04,2.907e-03,0.996 0/0:314,1:0.033:152,0:162,1:32:191,297:60:15:false:false
chrX 66766320 . C T . PASS DP=1382;ECNT=2;NLOD=74.68;N_ART_LOD=-2.139e+00;POP_AF=1.000e-03;P_GERMLINE=-7.140e+01;TLOD=7.39 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:1060,10:0.030:453,0:607,10:30:292,329:60:32:false:false:.:.:100.00:52.34:0.010,0.010,9.346e-03:0.042,4.118e-04,0.957 0/0:256,1:0.029:124,0:132,1:20:186,135:60:32:false:false


Best Answers


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @fengxie,

    Please see this article for an explanation.

  • fengxiefengxie ChinaMember

    Hi shlee,


    Base on the article, some uninformative reads will be counted towards the DP, but not the AD. I'm wondering how is AF calculated? Does Mutect2 count uninformative reads when calculating AF?

    I check the AD of chrX:66,766,320:C>T in IGV (see attached img), there are 1131 reads cover this site and 12 reads support allele, if we don't consider uninformative reads, the AF should be 12/1131= 0.01, but the AF calculated by Mutect2 is 0.03. Could you please explain more?


  • fengxiefengxie ChinaMember

    Thanks shlee. This is helpful!

  • YenanYenan Member

    Hi @shlee,

    I have read through all the threads you mentioned above, and I have several questions:

    First, could I come the the conclusions that (1) by using the GATK4-Mutect2, the output values of AD and AF are exactly correct; (2) we couldn't use the output value of DP to approximate the AF, because the DP is inaccurate?

    Second, in terms of FilterMutectCalls, according to our data I guess that the filter "clustered_events" is based on the "ENCT (number of events in this haplotype)". If the ENCT is equal or more than 3, it would show "clustered_events" for a certain variant. So, my question is what is the reason behind the default cut-off of ENCT as "2"? Dose it means that one haplotype shouldn't contain 3 or more variants? Could we make it more flexible/not stringent by increasing the value of "--max-events-in-region"?

    Thank you for your comments in advance! :)

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @manba are you referring to these comments from @Yenan?

    First, could I come the the conclusions that (1) by using the GATK4-Mutect2, the output values of AD and AF are exactly correct; (2) we couldn't use the output value of DP to approximate the AF, because the DP is inaccurate?

    I'm not sure what is meant here, actually. I mean, AD is correct in the sense that we define it to be the number of informative reads in support of each allele and that we don't know of any bugs in the implementation. Similarly, DP is correct, although people can argue about whether a different definition is preferable. AF is an unknown latent variable but we do our best with the somatic likelihoods model.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    That thread explains how AD is defined by GATK tools, which might not agree with the most naive definition. We think it's a good definition.

Sign In or Register to comment.