Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

MuTect2 misses a deletion that is obvious in IGV

Michael7Michael7 GermanyMember
edited April 2016 in Ask the GATK team

Hi, first of all, thanks for GATK! :)

I am using MuTect2 and it works well overall. In one of our samples, however, we know that a deletion exists at 1:23559238. It is also clearly visible in IGV for the preprocessed input BAM file (296/296 reads have C and 70 DEL). Unfortunately, it is not detected by MuTect2.

In the --m2debug log, the best detected haplotype still contains and correctly discovers this deletion, annotated as [GC*, G] for 1:23559237:INFO 14:06:48,326 EventMap - === Best Haplotypes === INFO 14:06:48,326 EventMap - CTGCAGGTCGAGAATGTAGTCGATGACGCGCTGTAGGATTTCCACCTGGCAAAGTGAGTGCCTCTCGGGACTCCGGGTACCAGTTCCCGCAGGCGGGAGGAGCAGTGGTTCATGTCGTCCAGCAAGCTCAGCGGCTCCTCAGCTGCCGGGCCCTTCCCT INFO 14:06:48,326 EventMap - > Cigar = 54M1D105M INFO 14:06:48,327 EventMap - >> Events = EventMap{1:23559234-23559234 [T*, A],1:23559237-23559238 [GC*, G],1:23559284-23559284 [T*, G],}One base before the deletion, at 1:23559237, there is another variant (IGV: 262G+93A=355reads). The fourth-best haplotype contains this event:INFO 14:06:48,328 EventMap - CTGCAGGTCGAGAATGTAGTCGATGACGCGCTGTAGGATTTCCACCTGGCTAAACTGAGTGCCTCTCGGGACTCCGGGTACCAGTTCCCGCAGGCGGGAGGAGCAGTGGTTCATGTCGTCCAGCAAGCTCAGCGGCTCCTCAGCTGCCGGGCCCTTCCCT INFO 14:06:48,329 EventMap - > Cigar = 160M INFO 14:06:48,329 EventMap - >> Events = EventMap{1:23559237-23559237 [G*, A],1:23559284-23559284 [T*, G],}Genotyping at this position results in two allelic fractions and two TLODs that are string-concatenated in the log (looks a bit weird as there is no space or other separator between the numbers...):INFO 14:06:48,413 SomaticGenotypingEngine - Genotyping event at 23559237 with alleles = [GC*, G, AC] (...) INFO 14:06:48,493 SomaticGenotypingEngine - Calculated allelic fraction at 23559237 = 0.131147540983606560.15873015873015872 INFO 14:06:48,494 SomaticGenotypingEngine - Tumor LOD at 23559237 = 161.8743235881842157.0755211370182In the final VCF file, only the second TLOD=157 is output at 23559237; the TLOD=161 (presumably the deletion) is lost:
1 23559007 rs11574 T C . PASS DB;ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=0.00;TLOD=1161.75 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:11,497:0.975:253:244:0.509:340,16766:1:10 1 23559234 . T A . PASS ECNT=2;HCNT=3;MAX_ED=3;MIN_ED=3;NLOD=0.00;TLOD=167.66 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:270,72:0.211:40:32:0.556:8577,2288:133:137 1 23559237 . G A . clustered_events;triallelic_site ECNT=2;HCNT=3;MAX_ED=3;MIN_ED=3;NLOD=0.00;TLOD=157.08 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:180,96:0.159:43:53:.:6450,3008:93:87I already tried --artifact_detection_mode to exclude any filter problems. Maybe it is a problem that the deletion at 23559238 is anchored at 23559237 and another variant is output at this position? (Is there a parameter to allow more than one VCF entry per start base pair?)

Tagged:

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Michael7
    Hi,

    First, have a look at this page that should help. The HaplotypeCaller recommendations apply to MuTect2 as well :smile:

    -Sheila

  • Michael7Michael7 GermanyMember

    Hi Sheila, thanks for your response. After reading the page you linked, I created a --bamOutput and looked at it in IGV (screenshot is attached). You can clearly see the deletion in the input BAM (upper panel) and in the output BAM (bottom panel). [Side note: I do not understand, why IGV shows more reads for the output BAM, maybe unmarked MarkedDuplicates...?] I also tried "--max_alternate_alleles 50" now for the interval 1:23558000-23560000, but this did not help either.

    The log qualitatively looks still the same to me, as described in my first post. In case it is of use to you, I have attached the BAM output and the log files.

    Thanks for your help, Michael.

    PS: For this test, I have used the nightly version from today (2016-04-25).
    PPS: We are certain that this deletion exists in the sample (Sanger validated).

  • Michael7Michael7 GermanyMember

    I have now downloaded your source code from November 2015 via https://github.com/broadgsa/gatk-protected.git and looked into it. (I didn't find any more current source code; is it correct?)

    I started in SomaticGenotypingEngine.java, after it outputs Tumor LOD at to the log. (This log line still seems to contain the deletion; see my first post.) There, more precisely below the comment //reconcile multiple alts, if applicable, it iterates through all detected alternate alleles and sets lodInd to the highest index of an alternate allele that passes the INITIAL_TUMOR_LOD_THRESHOLD. A few lines below that, only this alternate allele seems to be used to create the call; the line reads: tumorAlleles.add(mergedVC.getAlternateAllele(lodInd));. All alternate alleles for the same start position with an index <lodInd seem to be lost. In my case, the lower lodIndex (presumably, the missing deletion) even has a higher tumor LOD than the allele used in the call (TLOD=161.87 versus TLOD=157.07; see first post). I think that the call generation code should iterate over all qualified alternate alleles for the current start position, or am I missing something?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Michael7, that is the correct code. Our development repository is private, we release new code whenever we make a new version (one is coming in a few weeks but there are no significant changes in MuTect2).

    I think your reading of the code is correct. Would you be able to share a snippet of test data for us to review this locally? Instructions are here: https://www.broadinstitute.org/gatk/guide/article?id=1894

  • Michael7Michael7 GermanyMember

    Hi Geraldine, I have uploaded all requested files to your FTP server (file name: 2016-04-27=bugReport=MuTect2=missedVariantsBecauseOfSameStartPos.zip; the file is 881MB, as I had to included the reference FASTA; I am using the Ensembl hg38 version from ftp://ftp.ensembl.org/pub/release-81/fasta/homo_sapiens/dna).

    I think the easiest fix would be to cut all code between line 278 (final double tumorLod = tumorLods[lodInd];) and line 358 (returnCalls.add( annotatedCall );) and paste it below line 275 (lodInd = altInd;) to make use of the existing for loop for altInd. (Then, we could also use altInd directly, remove the lodInd variable and consolidate the two redundant >= MTAC.INITIAL_*_LOD_THRESHOLD checks.) What do you think?

    Thanks for your help and your time, Michael.

    Issue · Github
    by Sheila

    Issue Number
    839
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Michael7
    Hi Michael,

    I just put in a bug report. You can keep track of the status here.

    -Sheila

  • Michael7Michael7 GermanyMember
    edited June 2016

    Hi Sheila, thank you very much for reproducing the missed deletion in IGV and for the bug report. I read it again today and found a copy&paste mistake in it; maybe you can correct it? (Details: The copied sentence from my third post above starts with "All alternate alleles for the same start position with an index..." and should continue with "....<lodInd seem to be lost". This pinpoints the main cause in the code, I think. However, the cited sentence continues with ".... = MTAC.INITIAL_*_LOD_THRESHOLD checks" from my fourth post instead, and every info in-between is skipped.)

    We are currently waiting with the DNA-seq analysis of about 100 samples for which we validated our analysis pipeline based on previous Sanger results for three genes. Besides this missed deletion, all Sanger variants were rediscovered by the pipeline and by MuTect2. However, since this bug seems to systematically lose variants that start at the same base pair position, we decided to wait before we apply the pipeline to all measured genes. Based on your experience, how long will this bug fix still take? Should we continue to wait? I mean, how severe is this bug in your estimation? Many thanks, Michael.

    Post edited by Michael7 on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Michael7
    Hi Michael,

    No worries about the bug report. The developers usually look at the thread on the forum, so they will see all of your insight :smile:

    I just pinged the developer assigned to the bug, so when she gets back to me, I will let you know.

    -Sheila

  • Michael7Michael7 GermanyMember

    Okay, thanks. I just saw your new GATK v3.6 source code release; thanks for that, too! Unfortunately, the deletion is still missed with this new version...

    I have setup Maven/IntelliJ for it today. Next, I changed SomaticGenotypeCaller.java as indicated above and recompiled with mvn -verify. Now the deletion makes it into the VCF output.:smile: Any qualifying variant at the same position should be reported now...

    I have attached the new code file, full debug logs and VCF results, for both the fix and the original GATK v3.6 code. Maybe you find it useful. [Details: All code changes are tagged with "//MIC". I used artifact detection mode for debugging, but can confirm that it also detects the deletion without it (and without debugging flags). However, I couldn't test those parts of the code that are only active if you input paired normal samples; you might want to check that there are no unwanted side effects from the bugfix in this scenario...]

    Best, Michael.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Michael7
    Hi Michael,

    Thank you for submitting the fix and making sure it works! Actually, we do have a process for submitting a patch to the GATK codebase. Instructions are here. If you could follow those instructions, I will make sure someone reviews and merges it in soon.

    Thanks,
    Sheila

  • Michael7Michael7 GermanyMember

    Hi Sheila,
    okay, great. I have created a git patch and mailed it to gsa-patches at broadinstitute.org, according to the linked instructions. To support reviewing, I minimized code changes (no cleanup, just the fix; details are in the commit description). I recompiled this patched version and tested it again; it detects the deletion and works as expected, as far as I can tell.
    Thanks for reviewing,
    Michael.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Michael7, the developers are reviewing your report and testing possible solutions, including your suggested fix. As part of that they'd like to know if it's okay for them to use the data you provided in some integration tests. It would remain in our private codebase and would not be redistributed. Would that be alright, or is the data subject to usage restrictions?

  • Michael7Michael7 GermanyMember

    Great, thanks. Regarding data usage, I have checked back and, yes, you may use this anonymized data snippet around the deletion for integration tests in your internal codebase. :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Michael7, this has been addressed in the port of MuTect2 to GATK4 (sadly not in today's 3.7 release afaik). We'll post a roadmap for MuTect2 development/maturity in the near future.

Sign In or Register to comment.