Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

AF calculation in Mutect2

Hi,

I've been calling somatic mutations with Mutect2, and have noted that the AF calculation does not always correlate to DP alt/(DP ref+DP alt). For example, consider this indel:

T057.mutect2.vcf:7 17379247 . CAACAGCAACAGTCCTTGGCTCTGAACTCAAGCTGTATGGTACAGGAACACCTACATCTAG C . PASS ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=57.68;TLOD=106.87 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:66,30:0.452:18:12:.:1983,919:32:34 0/0:214,0:0.112:0:0:.:6285,0:98:116

The tumour variant AF should be (30/(30+66)=0.31, but is reported as 0.42, while the normal variant AF should be 0/214=0, but is reported as 0.112

Am I missing something/misinterpreting these fields?

Any comments and help greatly appreciated!

Thanks,

Stephen

Tagged:

Issue · Github
by Sheila

Issue Number
665
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    One possibility is that there are reads supporting one or the other allele that aren't being counted in the AD values because they're not considered informative for the purposes of the genotyping, but they are being counted in the allele fraction estimation. Another is that there are reads supporting some other allele that is not called, but is present in some minor fraction, and is therefore counted in the AF estimation as well. In both cases you would see them if you look at the read data in the bam file (preferably the "bamout" that can be generated as documented for HaplotypeCaller).

    If you can't find any evidence to support this in the read data, let us know and we'll take a deeper look. In that case we would need a snippet of your data to debug locally. Instructions for submitting a snippet are here: https://www.broadinstitute.org/gatk/guide/article?id=1894

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @stephensammut
    Hi Stephen,

    This article should help with what uninformative reads are.

    -Sheila

  • mikitebamikiteba MeldolaMember

    Hi Stephen,
    why it is not reported the DP value in the columns INFO or FORMAT?
    ES:
    ECNT=49;HCNT=1;MAX_ED=285;MIN_ED=12;NLOD=0.00;TLOD=9.51 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:1919,11:5.699e-03:1:10:0.091:69980,343:986:933

    Where I found the depth of the site like Mutect 1?
    ES:
    GT:AD:BQ:DP:FA 0/1:991,5:38:997:5.020e-03

    Thanks
    Michela.

    Issue · Github
    by Sheila

    Issue Number
    1174
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    DP isn't reported, but you can use the AD values (depth per allele) to approximate it.

  • kmhernankmhernan Chicago, ILMember
    edited September 2016

    @Geraldine_VdAuwera I am having perhaps a related problem, but it is much more worrisome. I have several somatic mutations that pass with 0 counts for the alternative allele in the AD section. Here is an example, with the first few GT data:

    0/1:28,0:0.107:0:0
    

    How is this possible? You can see the allele frequency is 0.107, but the counts are 28 and 0.

    Thanks

    Edit: are soft clipped bases considered uninformative?

    Edit2: I think i answered my own question here and after re-reading the link @Sheila shared. My bad. Carry on.

  • JieChenJieChen shanghaiMember

    @Sheila said:
    @stephensammut
    Hi Stephen,

    This article should help with what uninformative reads are.

    -Sheila

    Hi,
    can you help me explain the 100% AF with the depth 1892,3,and there is a lot of such mutation in my vcf file
    chr6 117638470 . AC A . PASS ECNT=1;HCNT=2;MAX_ED=.;MIN_ED=.;TLOD=11.60 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:1892,3:1.00:2:1:.:68770,66:951:941

    thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @JieChen
    Hi,

    Are you using GATK4 or GATK3? I think there were some issues in GATK3 AF calculation. Also, have a look at this thread.

    -Sheila

  • JieChenJieChen shanghaiMember

    @Sheila said:
    @JieChen
    Hi,

    Are you using GATK4 or GATK3? I think there were some issues in GATK3 AF calculation. Also, have a look at this thread.

    -Sheila

    Thanks. I was using GATK3.7 .

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @JieChen
    Hi,

    Yes, I think GATK4 should have a better AF calculation. I don't think this will be fixed in GATK3, as all development is now focused on GATK4.

    -Sheila

  • JieChenJieChen shanghaiMember

    @Sheila said:
    @JieChen
    Hi,

    Yes, I think GATK4 should have a better AF calculation. I don't think this will be fixed in GATK3, as all development is now focused on GATK4.

    -Sheila

    Hi,
    I will check gatk4 later.
    Another question: how does mutect2 treat with the reads which have the same start position(which may regard as duplication).
    Thanks.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @JieChen,

    GATK3-MuTect2 documentation lists the read filters that the engine uses upfront. You see the DuplicateReadFilter listed and so any reads marked as duplicate (with the 0x400 SAM flag) are omitted from analyses.

    In GATK4-beta.3-Mutect2 (tool is in beta), we see the same NotDuplicateReadFilter listed. I believe the team is working on a filter duplicate_evidence that becomes relevant for data with unique molecular indices and that have been marked with UmiAwareMarkDuplicates. I'm hazy on the details of this so let me know if it is important to you and I can consult a developer.

  • JieChenJieChen shanghaiMember

    Hi,@shlee,

    I switched to GATK4 Mutect2 to deal with my umi data, the data was preprocessed for umi duplication, and there is no duplicate flag in the bam file, I was wondering whether mutect2 will recognize duplicate with the reads position or only the flag.
    And the DuplicateReadFilter default is disable, right?

    the results of GATK4 mutect2:
    chr7 55273033 . C A . . DP=2056;ECNT=10;NLOD=19.00;N_ART_LOD=6.15;POP_AF=1.000e-03;P_GERMLINE=-1.570e+01;TLOD=160.73 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:MBQ:MCL:MFRL:MMQ:MPOS:REF_F1R2:REF_F2R1:SA_MAP_AF:SA_POST_PROB 0/1:1744,78:0.075:78:0:0.00:35,13:0,0:158,154:60,60:16,11:816:928:0.040,0.00,0.043:2.152e-20,1.000,1.076e-16 0/0:105,3:0.037:3:0:0.00:34,14:0,0:189,220:60,60:13,18:55:50

    IGV shows only 19 reads support alt ( don't known why can't attach the image here) , but there is 78 reads detected in vcf.
    why the support reads reads number is larger in vcf than in the bam?

    Thanks.

    Issue · Github
    by shlee

    Issue Number
    2466
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    sooheelee
  • JieChenJieChen shanghaiMember

    @shlee
    one more question:how does mutect2 determine the genotye of the mutation?are there any thresholds about AF or others?
    there are other positions like below example,in IGV, each two alt almost has the same support reads number, but for some positions the calling result report only alt , and others would report two alt.

    chr6 117643021 . C A,T . . DP=492;ECNT=4;NLOD=14.16,6.03;N_ART_LOD=-1.460e+00,3.32;POP_AF=1.000e-03,1.000e-03;P_GERMLINE=-1.085e+01,-2.728e+00;TLOD=66.46,23.73 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:MBQ:MCL:MFRL:MMQ:MPOS:REF_F1R2:REF_F2R1:SA_MAP_AF:SA_POST_PROB 0/1/2:230,46,53:0.190,0.188:29:17:0.370:38,16,8:0,0,0:167,168,160:60,60,60:15,3,16:77:152:0.101,0.131,0.140:0.776,4.158e-04,0.223 0/0:52,3,6:0.075,0.130:3:0:0.00:37,17,13:0,0,0:212,151,232:60,60,60:7,28,18:25:27

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @JieChen,

    We just released 4.beta.4 three days ago so can we assume you are using 4.beta.3?

    The upfront filtering of duplicates is done by the engine and the engine uses the SAM flag to define duplicates. This is on by default (see list of filters under Read filters here). If you need to disable a filter, you can use the --disableReadFilter option. For example, I have been using --disableReadFilter MateOnSameContigOrNoMappedMateReadFilter recently.

    There is another parameter in Mutect2, --maxReadsPerAlignmentStart that is not a duplicate filter but effectively uses read position to limit the number of reads that have the same start. By default, the parameter limits the number of reads per alignment start to 50. You can adjust this number. Remember that Picard's duplicate marking tools mark duplicates at the insert level for paired end reads, which is a more sophisticated approach than using a single mate in a pair.

    It's my understanding the two types of filtering are different from umi-related duplicate_evidence. I'll ask a developer to chime in here on its details and status as this is brand new and I know nothing about it besides its application to UMI data.

    When viewing on IGV, are you viewing the BAMOUT produced by Mutect2 or the BAM that serves as the input of the Mutect2 command? Also, do your IGV settings allow viewing supplementary alignments, which our tools will include in analyses? Finally, be sure IGV is not downsampling reads in your locus, which appears rather deep (AD=1744,78). You will see black lines below the coverage chart if there is downsampling by IGV. All of these options can be set on or off in IGV's View>Preferences panel, under the Alignments tab.

    As for the multi-SNP site, I believe M2 calls only apparent somatic sites. GATK4-M2 is sensitive to the allele in the matched normal and the known germline variants resource. If the site is multi-allelic, M2 calls any ALT alleles in the tumor sample that are not in the normal sample and that are not REF. However, the tool does not report putative loss of heterozygosity (LOH) events, where there is an ALT (against the reference) in the normal that is no longer in the tumor. To detect LOH events, we have another workflow, the ACNV workflow. Let me double-check these details with our developer as well.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @JieChen,

    For UMI-related artifact filtering, the read data must have gone through UmiAwareMarkDuplicatesWithMateCigar. Our developer says:

    The duplicate_evidence filter removes calls where the evidence for the variant only occurs in a single position in a read....This filter is not necessary for most workflows. In normal workflows it should be very rarely triggered.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @JieChen, I have more concrete information for you.

    To enable filtering by the duplicate_evidence filter, you need to have asked Mutect2 to annotation a non-default annotation with:

    -A UniqueAltReadCount
    

    For details on this annotation, see https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/UniqueAltReadCount.java.

    Basically, this provides an annotation for variant calls that counts unique evidence. Unique evidence is determined by a read's start and stop, even for paired data. That is, the mate is not a part of the equation, which as you know is the basis for Picard duplicate marking. If a variant call then comes from exactly the same position in UMI sister reads (that align identically), then the annotation notes it.

    With the presence of this annotation, FilterMutectCalls can then filter calls based such duplicate evidence.

    Remember that the engine itself, for M2, upfront filters any reads with the 0x400 duplicate SAM flag. UMI-aware duplicate marking should not flag UMI sisters (same alignment, different UMI) and so these remain in the analysis set of M2. You can of course disable the upfront duplicate filtering but I don't see a reason to do so for hybrid capture data.


    Our developer answers your questions on genotyping.

    The genotype is just a (potentially non-diploid) concatenation of all alleles that pass the lod threshold i.e. 0/1, 0/1/2, 0/1/2/3 etc. The AF field and the lods are the interesting information. We do not filter alleles based on any heuristic threshold of AF, AD etc, just the lod from our probabilistic model.

    If IGV shows fewer reads than the vcf AD field, the simplest possibilities would be 1) IGV's filters (MQ etc) are stricter than Mutect's or 2) the alt alleles show up in reassembly. One can check the bamout for possibility 2).

    Also, our developer finds the following question unclear:

    there are other positions like below example,in IGV, each two alt almost has the same support reads number, but for some positions the calling result report only alt , and others would report two alt.
    

    Are you seeing more discrepancies between the apparent read counts from IGV and the AD field in Mutect?

    Mutect's AD field takes into account probabilities. For example, if two reads each have a 0.6 probability of coming from some allele according to Mutect's mathematical model, that yields a total AD of 1.2, which is rounded to 1 since AD must be an integer.

    I hope this helps to clarify things.

Sign In or Register to comment.