Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Mutect2 pipeline fails for some inputs

I'm running a WGS analysis and parallelizing the run for each chromosome. Few chromosomes are failing the somatic variant calling process with the following error. It is really difficult to pinpoint what the problem is, because most of the chromosomes are processed correctly to the end (20 out of 24). I'm guessing there is some integer vs. floating point conversion error. For now, I would really appreciate if you could tell me how to get rid of this issue...!

I think this is really something you should fix.

I'm running Mutect2 in a docker container: GATK jar /gatk/gatk-package-4.0.11.0-local.jar

...
15:39:59.895 INFO  ProgressMeter -        chr2:89909854            218.8                770780           3523.1
15:40:09.967 INFO  ProgressMeter -        chr2:90031053            218.9                771760           3524.9
15:40:20.003 INFO  ProgressMeter -        chr2:90125567            219.1                772540           3525.7
15:40:30.146 INFO  ProgressMeter -        chr2:90285632            219.3                773610           3527.9
15:40:43.457 INFO  ProgressMeter -        chr2:90296305            219.5                773680           3524.7
15:40:53.596 INFO  ProgressMeter -        chr2:90357855            219.7                774100           3523.9
15:40:59.892 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 11.926295505
15:40:59.892 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 1194.2077419680002
15:40:59.892 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 2637.29 sec
INFO    2018-10-24 15:41:01     SortingCollection       Creating merging iterator from 8 files
15:41:13.039 INFO  Mutect2 - Shutting down engine
[October 24, 2018 3:41:13 PM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 220.03 minutes.
Runtime.totalMemory()=12962496512
java.lang.NumberFormatException: For input string: "."
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at java.lang.Double.valueOf(Double.java:502)
        at htsjdk.variant.variantcontext.CommonInfo.lambda$getAttributeAsDoubleList$2(CommonInfo.java:299)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.Collections$2.tryAdvance(Collections.java:4717)
        at java.util.Collections$2.forEachRemaining(Collections.java:4725)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsList(CommonInfo.java:273)
        at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsDoubleList(CommonInfo.java:293)
        at htsjdk.variant.variantcontext.VariantContext.getAttributeAsDoubleList(VariantContext.java:740)
        at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.getGermlineAltAlleleFrequencies(GermlineProbabilityCalculator.java:49)
        at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.getPopulationAFAnnotation(GermlineProbabilityCalculator.java:27)
        at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.callMutations(SomaticGenotypingEngine.java:155)
        at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:221)
        at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:230)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:291)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/gatk-package-4.0.11.0-local.jar

Best Answers

  • bhanuGandhambhanuGandham admin
    Accepted Answer

    Hi @registered_user

    The FTP issue has been resolved. You can now send us your bug report by following the instructions provided here.

    Regards
    Bhanu

Answers

  • As you can see Mutect2 processing is interrupted in the middle of chromosome 2. I really don't know if it's some temporary file during processing that is causing the issue.

  • EADGEADG KielMember ✭✭✭

    Hi @registered_user

    Have you run Picard ValidateSamFile on your input bamfile? If its fine...do you try another version of Mutect2 (4.X.X/3.6/3.7)?

    Greetings EADG

  • registered_userregistered_user Member
    edited October 26

    It looks like this bug arises (possibly) when there is a mutation that is present in the tumor sample and the population resource file, but not in the normal sample. I have prepared a bug report about this issue, but the FTP login credentials are not working.

  • Thanks for the suggestion. Yes, I validated my bam files. There were no errors there. I'm reluctant to go back to older versions because this is clearly a bug with Mutect2 and I have spent a lot of time setting up parallelization for version 4. If I understand correctly there is a quite a bit of changes between versions 3 and 4.

  • EADGEADG KielMember ✭✭✭

    Hi @registered_user ,

    yes there a some minor change in the version like the -T to choose the tool, be aware off to use the -ntcoption to speed up the analysis in GATK3.X there a some glitches in (HaplotypeCaller & Mutect2) with this options which were never solved. Another advise is to stick to version 3.7 it is the most stable version of the latest GATK3.X version (in my opinion). To gain some speed you can you use gather and scatter over mutiple regions of your samples.

    To upload a bug-report @Sheila or @Geraldine_VdAuwera or some other of the support team had to create an extra login for you on the ftp-server, the public server is readonly and limited to 25 users at once.

    Greeting EADG

  • Looks like Sheila has left the forum software.broadinstitute.org/gatk/blog?id=12887 They are not making submitting a report very easy at all.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @registered_user

    Sorry for the inconvenience. We are looking into this.
    I am taking over for sheila and sorry about dropping the ball on this one, I am new to the forum and still ramping up. But bear with us a little longer and we will help you find a solution. :)

    Regards
    Bhanu

  • registered_userregistered_user Member
    edited October 30

    @bhanuGandham
    Great, thanks. Let me know if someone is interested in receiving the bug report and files to reproduce the error that I have prepared. The FTP login credentials for submitting a report are not working.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    @registered_user

    I have asked the IT team to look into it. I will get back to you soon. We are interested in receiving the bug report.

    Regards
    Bhanu

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin
    Accepted Answer

    Hi @registered_user

    The FTP issue has been resolved. You can now send us your bug report by following the instructions provided here.

    Regards
    Bhanu

  • @bhanuGandham
    I've uploaded my bug report with NumberFormatException and Mutect2 in the filename. I hope someone will get a chance to look at it soon. Thanks for sorting the FTP server issue out.

  • Thanks for taking a look @shlee :)

    I'm using gnomAD data as my population germline variant resource. I only included the problematic line in the bug report, but the data is full of rows with "AF=." and a decimal value in "AF_raw" that do not cause any problems.

    I'm just noticing the FILTER column and what the different values mean. Does the data even need contain these rows if they are only to be filtered out? If you are saying that all rows with AF=. (i.e. AF=0.0) are completely ignored anyway, I agree that my best option is to get rid of them before running Mutect2.

    I'm thinking you should still do something about this issue, because like I mentioned this bug arises in only 4/24 of my WGS sample chromosomes after hours of processing. The error message is not very helpful at all, and the data is full of identical rows that are processed without problems. I'm thinking I'm not the only one who has trusted that Mutect2 can process the gnomAD files without any filtering or undocumented preprocessing. Going through the steps in preparing my files I can see that the liftover algorithm I used for the gnomAD data (hg37->38) has changed the AF values from "AF=0.00000e+00" to "AF=.". The fact that the algorithm has altered these values, makes no sense at all.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @registered_user,

    The order in which Mutect2 goes through the data allows for sites that are AF=. to not trigger the error. For example, if the tumor case sample is reference at such a location or if the site is present in the PoN.

    There is a difference between AF=0 and an absence of a record for that site as the absence of the record triggers Mutect2 to use --af-of-alleles-not-in-resource, typically set to a fractional value, which is different from zero. Let me ask you, how can you be absolutely certain that the entirety of the world human population does not present the variant for this site, which AF=0 implies? My guess is that you would not want such a case represented in the population resource file you present to Mutect2.

    Here is another consideration. What exactly do these filtered sites (zero allele count or failing random forest filters) represent (what is AF_raw?)? Could they represent sequencing or reference artifacts? If so, then they, in fact, belong in the PoN.

    Let's also consider a technical question. The resource file you present is a valid VCF, following VCF conventions. It is a data file we may want to parse for the genotypes it presents for a certain population subgroup, say e.g. using SelectVariants. It is also a resource file that you use with Mutect2 to represent commonly variant sites and their alleles. Towards this latter use, from a pipelining and production standpoint, if we expect the resource to be used against every case we will analyze, then it makes sense to clean up the data, e.g. make a sites-only VCF and to remove filtered sites. This allows every instance the resource is used to be more efficient. While we are at it, we may consider a cutoff for allele frequency, i.e. remove those variants that only present once or more if we suspect the data represents closely related individuals. For the Mutect2 workflow, such rare variant sites are much much better represented by the matched normal anyways.

    Our Mutect2 developer preprocessed the gnomAD GRCh37 resource, following the procedure outlined at https://github.com/broadinstitute/gatk/blob/4.0.11.0/scripts/mutect2_wdl/mutect_resources.wdl. In this WDL script, we see the data is subset to common biallelic SNPs. The workflow applies a minimum allele frequency cutoff (minimum_allele_frequency) that we can be sure is greater than zero. You may be interested in using such a script to preprocess your GRCh38 resource file. Alternatively, I have prepared a use-at-your-own-risk GRCh37-->GRCh38 gnomAD lifted-over resource. This and the GRCh37 version that our Mutect2 developer prepared are both available from the GATK Resource Bundle, on our FTP site and under the Mutect2 folder.

    I've represented your request as a discussion item in the gatk repo at https://github.com/broadinstitute/gatk/issues/5391. You can follow along and add your own two cents to the issue ticket with a GitHub account (free).

  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    @registered_user and @EADG I'll chime in with some comments and also clarify some things about Mutect2.

    @shlee is right about everything: 1) The AF=. is probably the source of the error and some sites with that don't cause problems because M2 doesn't end up parsing the germline resource vcf for those sites. 2) The provided files in the resource bundle do not have this issue and are optimized to run faster. Going further [this is my own editorializing], it's fair to say that you need an extremely good reason, like non-human samples, not to use the provided resources.

    Now for some unsolicited comments just because I'm fond of M2 and want to see it run optimally!

    I have spent a lot of time setting up parallelization for version 4.

    I'm curious about your particular needs, because M2 is automatically parallelized just by running the provided wdl with Cromwell, or by running the featured workflow on Firecloud. Now, if you're talking about thread-level parallelization, that's another story, but if you have multiple cores and want to reduce wall-clock time the wdl is intended to make this easy.

    Yes there a some minor change in the version like the -T to choose the tool. . . Another advise is to stick to version 3.7 it is the most stable version of the latest.

    We strongly recommend against using GATK3 M2. The superficial changes are minor, but internally the algorithm has been completely overhauled. According to our validations (the DREAM challenge samples, about 30 or so synthetic tumor in-vitro mixtures, ~1000 TCGA WES samples with matched WGS, and ~100 normal-normal call sets over several different sequencing protocols) you can expect GATK3 M2 to have about twice as many false positives, a few percent less sensitivity, and to be about 7x slower.

    Also, if you don't like GATK3 calls, nothing can be done because it's a deprecated tool. If you don't like your GATK4 M2 calls you can complain on the forum. While we can't promise to make you happy with every call (or non-call), we can promise that we like any sort of feedback!

  • registered_userregistered_user Member
    edited November 5

    @shlee and @davidben thanks again for digging into this issue. I don't think the resource bundle contained hg38 versions of the gnomAD data, and maybe you should not rely on every user using these files when they become outdated or for whatever else reason.

    I'm not familiar with the Cromwell + WDL-system, and I could not find any instructions/ tutorial on parallelizing the processing so I just went with my own system and splitting my runs for each chromosome. Looking at the mutect2_wdl github page (which I didn't know existed) the system does not seem very easy to get into. Also, I'm wondering what advantages or limitations it might have if I manage to get it running. Does it use the spark versions of the tools? The notes saying "beta" , "use at your own risk", "not for production use" are a bit off-putting when considering starting using them.

    One disadvantage of running multiple docker containers, I have discovered, is that if I run too many in parallel and run out of memory some of the runs just stop. No error message or anything, they just stop.

    I was able to successfully process my samples by filtering out all of the rows without "PASS" in the filter col in the gnomAD resource. Now I don't need to back to GATK3 thanks to your advice. :)

  • EADGEADG KielMember ✭✭✭

    @registered_user, @davidben, and @shlee thank for the advice and information, I am always willing to learn new things. I know that the GATK 4 has a completely new base. In the future, I will recommend going back to an older Mutect Version only for troubleshooting.

    @davidben, and @shlee

    We strongly recommend against using GATK3 M2. The superficial changes are minor, but internally the algorithm has been completely overhauled. According to our validations (the DREAM challenge samples, about 30 or so synthetic tumor in-vitro mixtures, ~1000 TCGA WES samples with matched WGS, and ~100 normal-normal call sets over several different sequencing protocols) you can expect GATK3 M2 to have about twice as many false positives, a few percent less sensitivity, and to be about 7x slower

    Wow, that cool could we get some statistics for that? Maybe as an addition to article Blog#10911 ?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    We're currently working on a Mutect2 paper that should be on bioarxiv in a couple of months.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭

    A bug fix for the AF=. issue has been submitted: https://github.com/broadinstitute/gatk/pull/5442

Sign In or Register to comment.