it's a bug for the calculation of Mutect2 allele frequency ?

HI,
when I used Mutect2 to call mutations .
I got this bug ,
chr3 12649448 . C CCAAA GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:139,66:0.143:0:1:.:138,31:0:5 .

66/(139+66) = 0.32, not 0.143.

so what's wrong with Mutect2 ??

Tagged:

Issue · Github
by shlee

Issue Number
2422
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Answers

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin
    edited August 2017

    Hi @YingLiu,

    You are showing the allele fraction is lower than expected from the allele depths:

    AD-->139,66
    AF-->0.143
    

    In HaplotypeCaller we can expect AD to be lower than expected (see Article#6005). Mutect2 and HaplotypeCaller share functionalities. Before we get into your question, can you tell us if you are using GATK3 or GATK4 Mutect2 and the specific release version? If you are using GATK3, then please switch to using GATK4 Mutect2. Also, can you post the M2 command that you use? We need to know if you using --useNewAFCalculator or the original QUAL model. Finally, I suggest you study (or share here by posting) the BAMOUT for the locus in IGV. I give instructions on how to do this in section 6 of the original Mutect2 hands-on tutorial at https://drive.google.com/open?id=0BwTg3aXzGxEDdXRsY1hWdzU5TzQ presented in February of this year in Belgium. Note that the tutorial uses GATK3 Mutect2.

  • YingLiuYingLiu ChinaMember

    @shlee,
    the gatk version : "The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18"
    the cmd is " java -jar GenomeAnalysisTK.jar -T MuTect2 -nct 20 -R ref.fa -I:tumor tumor.bam -I:normal normal.bam -L my.bed
    --dbsnp dbsnp_138.hg19.vcf --cosmic CosmicCodingMuts.vcf -o mutect2_ori.vcf "

  • dayzcooldayzcool Member

    @shlee,
    I would like to learn more about --useNewAFCalculator, too. Particularly, as a measurement for estimating fraction of clones, how do you compare default AF calculator, new AF calculator, and estimation from AD like @YingLiu did above?

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin
    edited August 2017

    Hi @YingLiu,

    I would strongly suggest you upgrade to GATK4 Mutect2. In GATK4, the functionalities are broken into two tools, Mutect2 and FilterMutectCalls. If you still want to understand what is going on in GATK3-MuTect2, then would you mind creating a snippet of the region from your BAM and uploading this to our FTP site? It's difficult to tell what is going on without IGV visualization. Instructions are in https://software.broadinstitute.org/gatk/guide/article?id=1894.

    @dayzcool, I know -newQual is recommended for the germline side (HaplotypeCaller and GenotypeGVCFs). I just saw that the argument is listed under parameters for Mutect2 and FilterMutectCalls and wanted to gather as much information as possible from @YingLiu. I'll have to consult our developer (currently on vacation) exactly how the somatic workflow uses this parameter. It may be that the somatic side does not use this option at all and our documentation needs to be more discerning.

  • YingLiuYingLiu ChinaMember

    @shlee
    I am generating bam for that site ,and will upload my result after finished .
    thank you very much !

  • YingLiuYingLiu ChinaMember

    @shlee , I have upload my result named "shlee_ying.tar.gz" at the shlee_ying directroy .
    please check .

  • YingLiuYingLiu ChinaMember

    @shlee said:
    Hi @YingLiu,

    I would strongly suggest you upgrade to GATK4 Mutect2. In GATK4, the functionalities are broken into two tools, Mutect2 and FilterMutectCalls. If you still want to understand what is going on in GATK3-MuTect2, then would you mind creating a snippet of the region from your BAM and uploading this to our FTP site? It's difficult to tell what is going on without IGV visualization. Instructions are in https://software.broadinstitute.org/gatk/guide/article?id=1894.

    @dayzcool, I know -newQual is recommended for the germline side (HaplotypeCaller and GenotypeGVCFs). I just saw that the argument is listed under parameters for Mutect2 and FilterMutectCalls and wanted to gather as much information as possible from @YingLiu. I'll have to consult our developer (currently on vacation) exactly how the somatic workflow uses this parameter. It may be that the somatic side does not use this option at all and our documentation needs to be more discerning.

    @shlee,
    I have use gatk4 ,and I found very big changes for that variants site
    chr3 12649448 . CAAAAAAA CAAAA,CAAAAA,CCAAAAAAAA
    GT:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
    0/1/2/3:89,14,9,22:0.147,0.148,0.215:32,32,31,32:0,0,0,0:225,310,246,258:60,60,60,60:30,44,39,26:0.051,0.061,0.063:5.112e-03,0.049,0.945

    there has big difference with gatk3 . so what happened ??
    gatk4 is also beta version ,so i prefer to use gatk3 . so could you please tell me the detailed reason for that ?

    thank you very much!

  • YingLiuYingLiu ChinaMember

    @shlee said:
    Hi @YingLiu,

    You are showing the allele fraction is lower than expected from the allele depths:

    AD-->139,66
    AF-->0.143
    

    In HaplotypeCaller we can expect AD to be lower than expected (see Article#6005). Mutect2 and HaplotypeCaller share functionalities. Before we get into your question, can you tell us if you are using GATK3 or GATK4 Mutect2 and the specific release version? If you are using GATK3, then please switch to using GATK4 Mutect2. Also, can you post the M2 command that you use? We need to know if you using --useNewAFCalculator or the original QUAL model. Finally, I suggest you study (or share here by posting) the BAMOUT for the locus in IGV. I give instructions on how to do this in section 6 of the original Mutect2 hands-on tutorial at https://drive.google.com/open?id=0BwTg3aXzGxEDdXRsY1hWdzU5TzQ presented in February of this year in Belgium. Note that the tutorial uses GATK3 Mutect2.

    @shlee
    I have checked your tutorial and do the step 6.1 and 6.2 ,i got IGV result ,however i do not know how to discrern what's good calls or bad calls .I have no feel about that indel calls result .
    could you please give me some detailed introduction about how to judge what's good calls or bad calls via IGV ??
    thank you !

    the result image links(FYI) :
    https://ibb.co/nLjJuv

    Issue · Github
    by shlee

    Issue Number
    2435
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin
    edited August 2017

    Hi @YingLiu,

    I've received your tarball and will look into this today. Thanks for the additional information.

    Actually, both gatk3-MuTect2 and gatk4-beta's Mutect2 are in BETA status. The additional beta label for GATK4 is about the engine, which affects tools across the program. Since both versions of M2 are in BETA status, it's better for you to use gatk4-Mutect2 as it contains improvements and we will continue to develop this version of the tool.

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin
    edited August 2017

    Hi @YingLiu,

    I've taken a quick look at the locus and there are a couple of things going on.

    We are still experiencing a bug in our forum on attaching images so here is the screenshot I took: https://www.dropbox.com/s/wtzbnumh6ywdozu/Screenshot%202017-08-21%2011.16.18.png?dl=0. I clean out my dropbox often so be sure to download this image if you want to refer to it again later.

    1. The call is at a site that is proceeded by 24 As. Haplotype assembly depends on a factor called kmer size. The default is set to 10 and 25. For this site it would be advantageous for you to specify a larger -kmerSize that can encompass the entirety of the string of As, e.g. 30 and higher. This will (i) reduce the number of artificial haplotypes and (ii) increase the number of informative reads for your samples, of which you currently have a rather small proportion for each sample. This should help to better define the size of the indel(s) for the site in both your samples.

    2. This site is in dbSNP and in the Mills indels resource. It is purportedly a site of common variation, so I'm not sure how interested you are in it for your somatic analysis, especially given your normal is also variant at this site. I have one of the latest snap-shot builds of IGV that show the number of bases inserted/deleted. If I sort on it, we see that both your normal and tumor have a range of insertions and deletions predicted around these sites in the reads.

    This site, given (i) it has a higher chance of experiencing artifactual expansions and contractions during PCR amplification and during sequencing-by-synthesis, (ii) is reported as a site of common variation in the Mills resource, which orthogonally verified their indels and (iii) presents as variant in your normal, I should think you be disinterested in this site. It has a high chance of being a germline variant and not a somatic variant. Downstream filtering, e.g. with gatk4-FilterMutectCalls, may help to filter this site.

    I think your analysis could benefit from use of a Panel of Normals (PoN). By using many normals sequenced the same way as your tumor and normal, you can see if such variation at your site is widespread across germline samples and filter these out with peace of mind. If not common, then you can decide on a course of action.

    As an aside, I'm noticing (according to RefSeq) this site in is the intronic region, well away from the exon boundaries. I can imagine that cis-elements within intronic regions could factor towards this gene's expression. Also, this gene locus is at a boundary, where the sense-strand switches. That is, for the gene at the immediately lower coordinate region we see it is expressed in the right-ward direction. The locus we are examining expresses in the leftward direction. Are these particular genes of great interest to your studies? If so, then I believe the new features of gatk4-Mutect2/FilterMutectCalls can provide finer grained analysis as the new M2 can be sensitive to alleles and I think can help distinguish between differently sized indels.


    As for your followup questions:

    Yes, there are big changes between gatk3-MuTect2 and gatk4-Mutect2. It appears that someone has modified the latest gatk4-Mutect2 tool documentation. However, if you look at the v4.beta.1 documentation that I wrote a while ago, I outline the key differences between gatk3-M2 and gatk4-M2. I have a section titled "How GATK4 Mutect2 differs from GATK3 MuTect2". Here are the links to the v4.beta.1 documents for Mutect2 and FilterMutectCalls.

    Be sure to also check the latest documentation, currently 4.beta.3, as our developers are actively changing the tools and their parameters.

    Again, the BETA label is used in two ways: at the program level and at the tool level. Both GATK3 program's MuTect2 tool and GATK4 program's Mutect2 tool are in BETA status. For the program itself, GATK4 is in BETA status. You are better off switching to the newer version of Mutect2.

    You ask for detailed instructions on how to judge good vs. bad calls. I know that manual review is vital to somatic analyses. However, this type of instruction is beyond the scope of what the GATK team offers. This is in part because review depends on the artifacts particular to your sequencing method, sample prep, cancer type (and expected mutation signatures) and preprocessing and analysis pipeline. My recommendation to you is that you take full advantage of the PoN and known variant resources to minimize the number of sites you need to manually review.

    I hope all of this is helpful.

  • YingLiuYingLiu ChinaMember

    @shlee, thank you very very very much !
    I will check gatk4 later .
    if you come to China ,let me know please ,I hope treat you to a dinner .
    I also have such the following these questions :
    1. about " presents as variant in your normal", how do you know that site appears at normal sample ,according to the output bam file from '--bamout' ?
    I have followed your tutorial https://drive.google.com/open?id=0BwTg3aXzGxEDdXRsY1hWdzU5TzQ to check that bam file ,and find that that bam have 3 types of smple : 'normal' ,'tumor' and 'HC" reads ,I do not know why will have three types ? I guess both 'normal' and 'tumor' reads are all from orignal bam,HC is from assembly reads .
    I found HC reads have 5 reads to supprot 'AAA' insertion , what does it means ? so what's the function of 'HC' reads ?
    2. about " is reported as a site of common variation in the Mills resource", is there the latest link to download GATK resource bundle including such as "dbSNP" ,"Mills.database files etc ?
    3. Any details or paper to introduce about "it has a higher chance of experiencing artifactual expansions and contractions during PCR amplification and during sequencing-by-synthesis" ?

    1. about "You ask for detailed instructions on how to judge good vs. bad calls" ,is there any paper to learning web site or e-mail group to introduce that ?
  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator admin
    edited August 2017

    Thanks @YingLiu for the offer for dinner! I hope some day to visit the middle kingdom and I will take you up on that offer.

    1. The BAMOUT you sent me contains reads from the normal sample. You can ask IGV to group by sample or readgroup (see http://software.broadinstitute.org/software/igv/PopupMenus). Each grouping is separated by a line in the window. The HC readgroup refer to the artificial haplotypes that M2 generates. They can be chimeric in that they may contain combinations of variants that do not make sense phasing-wise. If you color by HC tag, then you see which reassembled reads support which artificial haplotype as their colors should match.
    2. Whatever files we have in our resource bundle (https://software.broadinstitute.org/gatk/download/bundle) are the files we provide as-is. There should be a Mills resource for each of the reference assemblies.
    3. I'm sure there are many papers that note this artifact of PCR and sequencing (by synthesis). I just don't have any to cite for you. What pops into my mind is a paper our developers discussed at a journal club on short tandem repeats (https://www.ncbi.nlm.nih.gov/pubmed/22522390; DOI: 10.1101/gr.135780.111), which are somewhat different from your string of As.
    4. Unfortunately not in a detailed way. We are updating our materials to touch upon variant review for the GATK workshop, in the M2 handson tutorial. But the scope will be introductory. If you think this would be sufficient to get you started, GATK will be in China in a few weeks to give a special workshop (https://software.broadinstitute.org/gatk/events), but I'm not sure if it is open to the general community. Be sure to check our announcements and the link on the events page to see if it updates to a registration page. Otherwise, I recommend you try to gain experience by talking to an experienced genomicist.
  • YingLiuYingLiu ChinaMember

    @shlee said:
    Hi @YingLiu,

    I would strongly suggest you upgrade to GATK4 Mutect2. In GATK4, the functionalities are broken into two tools, Mutect2 and FilterMutectCalls. If you still want to understand what is going on in GATK3-MuTect2, then would you mind creating a snippet of the region from your BAM and uploading this to our FTP site? It's difficult to tell what is going on without IGV visualization. Instructions are in https://software.broadinstitute.org/gatk/guide/article?id=1894.

    @dayzcool, I know -newQual is recommended for the germline side (HaplotypeCaller and GenotypeGVCFs). I just saw that the argument is listed under parameters for Mutect2 and FilterMutectCalls and wanted to gather as much information as possible from @YingLiu. I'll have to consult our developer (currently on vacation) exactly how the somatic workflow uses this parameter. It may be that the somatic side does not use this option at all and our documentation needs to be more discerning.

    @shlee, my orignal post " HI,
    when I used Mutect2 to call mutations .
    I got this bug ,
    chr3 12649448 . C CCAAA GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:139,66:0.143:0:1:.:138,31:0:5 .

    66/(139+66) = 0.32, not 0.143.

    so what's wrong with Mutect2 ??"

    from this link https://software.broadinstitute.org/gatk/documentation/article.php?id=6005, I just know
    "the DP in the INFO field is unfiltered,the DP in the FORMAT field is filtered",

    Issue · Github
    by shlee

    Issue Number
    2468
    State
    closed
    Last Updated
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @YingLiu
    Hi,

    Sorry for the delay.

    It turns out the AF calculation takes into account the likelihoods of the reads supporting the alleles. From the developer:

    A few weeks ago we changed the code so that all annotations are based on the same filtered reads. However, AD is the number of reads that more likely than not support an allele. So if you have 10 reads, each with 0.6 probability of having a certain alt allele, you get an AD of 10, whereas you get essentially 0.6 x 10 = 6 reads for the purpose of AF.

    Have a look at this document for some more information.

    -Sheila

  • twelvesummertwelvesummer Member

    I am still confused in the AF calculation.
    @YingLiu 66/(139+66) = 0.32, not 0.143.
    And which one is the sample's mutation frequency 0.32 or 0.143?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @twelvesummer
    Hi,

    The way our tools calculate AF, they take into account the likelihoods of the reads supporting the alleles. Those likelihoods are not available in the VCF output, but I think you can have them output in an intermediate file (but, you would have to look into the code to do that).

    In this case, if you don't want to take the likelihoods into account, you can simply divide the AD values like you wrote above.

    -Sheila

Sign In or Register to comment.