Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

insertions like AAAAAA/GGGGGG/CCCCCC/TTTTTT by Mutect2

jh2663jh2663 NYCMember
edited April 24 in Ask the GATK team

I recently got a very frustrated result by GATK4.0 Mutect2 for some of my tumor samples. They dumped out too many insertions like AAAAAA / GGGGGG / CCCCCC / TTTTTT which seem to mostly be artifacts. VAFs are generally low but determined as significant ones at the final step and even many of them are located in exonic regions. I got lost at this point so just want to hear any experience or possible explanations?

  • again, this is happening only in subset of samples that went through the same pipeline together. Also, I don't see any noticeable insertions when manually taking a look at original BAM files for each location.

This is showing an example of paired reads in bamout (upper pannel) and the original BAM file (bottom one). The dark read ones are the same read bearing 'AAAAAA' insertion in one of the pairs.

Best,

Post edited by jh2663 on

Answers

  • jh2663jh2663 NYCMember
    edited April 24


    I got another point that all of those polyX are located in 3'end of each read. So, is there any way in Mutect2 for discarding all variations found in 3'-end of only forward or only reverse strands? Or, I will give a try 3'-end trimming using ClipReads before applying Mutect2.

    Any suggestions would help. thanks.

    Post edited by jh2663 on
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @jh2663

    Please try with the latest GATK4.1.2.0 and let us know if the error persists.

  • jh2663jh2663 NYCMember
    edited April 25

    Thanks. I ran GATK4.1.2.0 with --dont-use-soft-clipped-bases (true) option and found most of polyX artifacts disappeared now. I didn't try GATK4.1.1.0 with the same option but believe this is a matter of the soft-clipped bases option rather than GATK version.

    But again, this kind of artifacts was seen only in subset of my BAM files under the sample pipeline. So I think some of my sequencing data might have more poly X bases at the end of each read for some unknown reasons. And most of those poly X ones might be already soft-clipped in the BAMs due to their lower base quality but unfavorably used in Mutect2, as I understand.

    BTW, I also see that the total number of mutations including SNPs and small INDEL also decreased by ~ 10% even for all samples that didn't show polyX artifacts. Do you think whether this might be a concern? To my understanding, soft-clipped bases are mainly to infer long size INDEL, which is not my main interest. Any other thought?

    Best

    before applying the option

    ==>
    after applying the option

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @jh2663 There seems to be an issue with some of your samples, but we often like to make Mutect2 smart enough to deal with artifacts like this automatically. Could you post screenshots of 1) a few of these homopolymer insertions in the output of FilterMutectCalls and 2) the .filteringStats.tsv output of FilterMutectCalls?

  • jh2663jh2663 NYCMember

    Is this what you need? Thanks.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @jh2663 This is helpful but I need to ask more. When you see a polyX insertion artifact what does the reference look like for the 20 or so bases before and after the insertion? I'm wondering whether there is a polyX in the reference. You could do one of three things: 1) show the reference track in your IGV screenshot of the Mutect2 bamout, 2) paste the reference bases around a few artifacts, or 3) run Mutect2 on a few of these sites with -A ReferenceBases.

    By the way [unrelated to the artifact] are you running FilterMutectCalls on a smaller interval than you ran Mutect2 on? FilterMutectCalls needs to run on the entire Mutect2 callset; otherwise the somatic clustering model won't work correctly.

  • jh2663jh2663 NYCMember

    OK. I will do it later for the first point since I need to rerun Mutect2.

    For the 2nd question, I ran Mutect2 and then FilterMutectCalls with the same interval (each chromosome).

    Best,

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    For the 2nd question, I ran Mutect2 and then FilterMutectCalls with the same interval (each chromosome).

    You will get better results if you combine chromosomes:

    gatk MergeVcfs -I chr1.vcf -I chr2.vcf . . . -I chrX.vcf -O unfiltered.vcf
    
    gatk MergeMutectStats -stats chr1.vcf.stats -stats chr2.vcf.stats . . . -stats chrX.vcf.stats -O unfiltered.stats
    
    gatk FilterMutectCalls -R ref.fasta -V unfiltered.vcf -O filtere.vcf
    
  • jh2663jh2663 NYCMember

    You mean.. run Mutect2 for each chromosome and then combine all vcfs for FilterMutectCalls? Could you explain more why it gets better?

    Anyway, I will give a try all of your suggestions. Thank you so much!!

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    It gets better because as of GATK 4.1.1 FilterMutectCalls learns a model of somatic allele fraction, including subclonal clusters and overall mutation rates. It's more powerful if it can learn over all data simultaneously. The details are described in the Mutect2 documentation: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf.

    Just to be clear, any one of suggestions 1-3 will give enough information to start hypothesizing. You don't need to do all three.

Sign In or Register to comment.