We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Mutect2 repeatedly not detecting somatic variant IDH2 R172K, with solid read support and 5% AF

mack812mack812 SpainMember
edited October 2019 in Ask the GATK team

Hi,

I am currently validating a GATK analysis pipeline applied to data from amplicons and deep coverage (>1000x) for detection of somatic variants at hotspot sites in cancer samples. As part of this validation I have included reference samples from a commercial provider, which harbour many relevant variants at defined allelic frequencies. The result of the validation so far is quite satisfactory overall, especially for a limit of detection of 5% (FYI 1% AF-LOD is worse, and detection is highly dependent on depth of coverage, requiring more supporting reads than I expected). However, there is a single variant (IDH2 R172K) that goes consistently undetected by Mutect2, in two independent runs of this sample. The variant is present at 5% allelic frequency and was covered at 2000-3000x (100-150 supporting reads) and therefore was not expected at all to cause any troubles to Mutect2. The variant can be easily seen in IGV and has very deep coverage in the bamout (hence it is not a problem of an active region not being triggered at this site or of reads being filtered out by the tool).

To make this issue even more difficult to understand, another variant less than 15 bp away was correctly detected. Moreover, this same variant (IDH2 p.R172K) was detected in a different sample at an even lower AF (1%). So, it seems that Mutect2 is unable to detect the variant IDH2 p.R172K in the specific sequence context present at this sample (as said before, this happened twice: in two independent runs).

I am using GATK 4.1.3.0 and the following Mutect2 command:

gatk --java-options "-Xmx4000m" \
  Mutect2 \
    -R $ref_fasta \
    -I $inpath/$filename \
    -germline-resource $gnomeAD_af \
    -L $interval_hotspots \
    --max-reads-per-alignment-start 0 \
    --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
    --disable-read-filter NotDuplicateReadFilter \
    -O $vcf_outpath/${basename}.unfiltered.vcf \
    -bamout $vcf_outpath/${basename}.bamout.bam

I am concerned that this could be happening at other regions, potentially increasing the rate of false negatives, so I would appreciate if you could help me identify the underlying problem here.

Best Answers

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MA admin
    Accepted Answer

    Hi @mack812 , I will write a more detailed response to you tomorrow, but wanted to let you know that we found out by running in
    --mitochondria-mode , the variant was recovered. Can you give that a try?

  • davidbendavidben Boston ✭✭✭
    Accepted Answer

    @mack812 What's going on under the hood is as follows: when M2 initially tries to assemble with a kmer size of 10, there happens to be a few stray reads with an error that induces a cycle in the graph. That is, there is, say, a kmer AAACCCTTTG toward the end of the amplicon and a kmer AAAGCCTTTG toward the beginning. A single base error in the former can induce a path at the end of the amplicon to jump to the beginning.

    When the assembly engine finds a cycle, it gives up and tries a larger kmer, by default 25. This is fine (there is a 10-base pseudo-homology but not a 25-base one), but because the variant occurs less than 25 bases from the start of the amplicon it ends up on a dangling end of the graph. We have some code to handle this but it's a bit sloppy and incomplete, to be honest, so the variant is missed.

    We are working on improvements to the assembly engine that will let it handle cycles in the graph (using the linked de Bruijn graph structure that our colleague Kiran Garimella co-authored: https://academic.oup.com/bioinformatics/article/34/15/2556/4938484), and are also improving dangling end recovery. Until these efforts mature, however, mitochondria mode is probably the best and most principled fix.

    Since you want more details, I'll first describe a couple of hackier solutions. One was to hope that a 14-mer would not induces cycles in the graph. The variant is 15 bases from the amplicon edge, so this is the biggest kmer you can get without inducing a dangling end. Indeed it does not, so --kmer-size 14 solves the problem. The second solution is to use default downsampling settings, which reduces your amplicon coverage greatly, of course. Purely by luck this ends up removing enough reads with the cycle-inducing error and assembly with 10-mers works.

    The reason mitochondria mode works is because it is better at recovery dangling ends at high depth. Assembly with 10-mers still fails and 25-mers still put the variant on a dangling end, but in mitochondria mode we're able to grab the variant from the dangling end. At lower depths this creates a handful of weird false positives so we can't recommend always turning on mitochondria mode. However, I think it's safe to recommend mitochondria mode as the best practice for amplicon sequencing.

Answers

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @mack812, one thought is that the variant could be getting filtered out by the germline resource. However, I did a quick search for the variant in the gnomad browser, and nothing came up.

    Can you provide a sliver of the input bam (1000 bp on each side of this locus) so that we can debug? Here are instructions on how to provide us files.

  • mack812mack812 SpainMember

    Hi @Tiffany_at_Broad, thanks for your reply.

    I have uploaded the file mack81_IDH2-R172K_not-detected_issue.zip following the instructions. Please let me know if you need any additional information/files.

    The coordinates of the variant not being detected (hg19):
    chr15:90631838 C>T

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Thanks @mack812 ! The developer I am debugging with is out of the office most of this week, so we will try to get back to you by the end of the week!

  • mack812mack812 SpainMember

    Hi @Tiffany_at_Broad, thanks for your reply. Did the developer take a look at this already?

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Can you verify if this is happening for this coordinate with GATK 4.1.4.0?

  • mack812mack812 SpainMember

    Hi @Tiffany_at_Broad, I just checked with GATK 4.1.4.0 and same result: the variant was not detected. All other confirmed variants in this reference sample were detected with GATK 4.1.4.0.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Thanks @mack812 . I checked with the team and it is on our backlog of things to investigate. Thanks for reporting it and we'll report back when we know more.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin
    Accepted Answer

    Hi @mack812 , I will write a more detailed response to you tomorrow, but wanted to let you know that we found out by running in
    --mitochondria-mode , the variant was recovered. Can you give that a try?

  • mack812mack812 SpainMember
    edited December 2019

    Hi @Tiffany_at_Broad,

    Thanks for your reply.

    This is funny: I found out the same thing just some days ago. At the end there were several other variants like this one (ignored by M2 even if present at high AFs) in my validation dataset. Exploring the bamout, they were being left out of the haplotypes built by M2 for no apparent reason. I need to finish the re-analisis to be completely sure but so far all of them are being correctly called now, after adding the mitochondria mode option. I tried many other things but this was the only one able to solve this issue. I guess it has to do with calling variants in very deeply covered regions, which is true for both mitochondria and for my data. Anyway I would love to hear a more detailed explanation.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭
    Accepted Answer

    @mack812 What's going on under the hood is as follows: when M2 initially tries to assemble with a kmer size of 10, there happens to be a few stray reads with an error that induces a cycle in the graph. That is, there is, say, a kmer AAACCCTTTG toward the end of the amplicon and a kmer AAAGCCTTTG toward the beginning. A single base error in the former can induce a path at the end of the amplicon to jump to the beginning.

    When the assembly engine finds a cycle, it gives up and tries a larger kmer, by default 25. This is fine (there is a 10-base pseudo-homology but not a 25-base one), but because the variant occurs less than 25 bases from the start of the amplicon it ends up on a dangling end of the graph. We have some code to handle this but it's a bit sloppy and incomplete, to be honest, so the variant is missed.

    We are working on improvements to the assembly engine that will let it handle cycles in the graph (using the linked de Bruijn graph structure that our colleague Kiran Garimella co-authored: https://academic.oup.com/bioinformatics/article/34/15/2556/4938484), and are also improving dangling end recovery. Until these efforts mature, however, mitochondria mode is probably the best and most principled fix.

    Since you want more details, I'll first describe a couple of hackier solutions. One was to hope that a 14-mer would not induces cycles in the graph. The variant is 15 bases from the amplicon edge, so this is the biggest kmer you can get without inducing a dangling end. Indeed it does not, so --kmer-size 14 solves the problem. The second solution is to use default downsampling settings, which reduces your amplicon coverage greatly, of course. Purely by luck this ends up removing enough reads with the cycle-inducing error and assembly with 10-mers works.

    The reason mitochondria mode works is because it is better at recovery dangling ends at high depth. Assembly with 10-mers still fails and 25-mers still put the variant on a dangling end, but in mitochondria mode we're able to grab the variant from the dangling end. At lower depths this creates a handful of weird false positives so we can't recommend always turning on mitochondria mode. However, I think it's safe to recommend mitochondria mode as the best practice for amplicon sequencing.

  • mack812mack812 SpainMember
    edited December 2019

    Hi @davidben,

    Thanks so much for such a detailed explanation. If I understood correctly, this issue is probably restricted to amplicon enrichment, due to the low diversity of start-end positions inherent to it. Most hybridization capture protocols start with physical fragmentation or tagmentation of DNA, which create a much higher diversity of read ends. I guess that prevents the graph cycle problem from happening, right?

    Also thank you for confirming that mitochondria mode is the current way to go with amplicons. If by any chance other examples of this could be useful to your development work on this issue (other variants missed in amplicons when running M2 w/o mt-mode), I will be glad to share the BAM snippets with you.

    At lower depths this creates a handful of weird false positives so we can't recommend always turning on mitochondria mode. However, I think it's safe to recommend mitochondria mode as the best practice for amplicon sequencing.

    Now that you bring that up, I am in fact observing a very weird situation (though not at low depths): a pretty curious set of false positives happening always towards the end of one specific amplicon where "ghost variants" are called, not always at the exact same site but in a very narrow range of postions. M2 calls variants there with more than 5% AF and dozens/hundreds of supporting reads (i.e. alt allele AD) but when inspected on IGV I only find a few reads supporting these "ghost variants". Being the region so well delimited, I am not that worried about it (they affect a non-hotspot region plus it will be easy to keep it under control), so mt-mode is still worth it. I will post this issue on a separate thread once I have a clearer idea of its extension.

    Post edited by mack812 on
Sign In or Register to comment.