We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Excessive memory usage with MuTect2

zgompertzgompert Utah, USAMember

I am trying to use MuTect2 for somatic variant discovery. I am running GATK v4.1.4.0 with Java HotSpot(TM) 64-Bit Server VM v1.8.0_112-b15.

When running either a single sample (tumor only mode to generate a panel of normals) or with a tumor and a normal sample, the memory usage is very high > 400 GB RAM. This is not the case initially, but memory usage gradually climbs during the run. The data are not whole genome sequence data, but rather are RADseq/GBS data. This means much of the genome is not covered by reads, but where there are reads, they start and stop in similar places and cover a ~85 bp region with moderate coverage (around 10X on average). Here is an example of the command I am running (note that I have made some modifications to the standard command to add more memory and obtain additional information for debugging):

java -Xmx384g -XX:-UseGCOverheadLimit -jar ~/bin/gatk-package- Mutect2 -R /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/genome/Potrs01-genome.fa -I aln_mem_mod_003-S.uniqe.bam -I aln_mem_mod_013-S.uniqe.bam -normal potr-mod_013-S --independent-mates --max-mnp-distance 0 -debug --dont-increase-kmer-sizes-for-cycles -O somatic.vcf.gz

The run generates a vcf file that doesn't have any obvious errors for the regions of the genome it gets to, but fails to finish before running out of memory.

I have tried the identical command on a different data set with whole genome sequences and do not see the same memory issue. Thus, I think the problem with memory usage stems from the RADseq/GBS data. With that said, I don't know what about RADseq/GBS data would cause such a problem. Additionally, the reference genome I am using for aligning the RADseq/GBS data is highly fragmented (most contigs ~10 kb). Are there any modifications I might be able to make to the command I am running that could solve this problem?



  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi ,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal/erroneous results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, see this [announcement](https://software.broadinstitute.org/gatk/blog?id=24419 “announcement”) and check out our [support policy](https://gatkforums.broadinstitute.org/gatk/discussion/24417/what-types-of-questions-will-the-gatk-frontline-team-answer/p1?new=1 “support policy”).

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @zgompert This is a very interesting question. M2's RAM consumption for this data should be around 2-3GB. Somehow there must be a memory leak that is usually minor or non-existent accumulating over a bunch of small contigs. I have a wild guess that the memory leak might be in the downsampler hanging on to a few reads per contig. You could test that by turning off downsampling with --max-reads-per-alignment-start 0 in your command.

    Could you tell me more about this reference? Why does it need to be fragmented (ie why a fragmented reference and not the regular reference with an interval list of restriction fragments)? How many contigs does it have?

Sign In or Register to comment.