Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Question about Mutect2 Runtime on Whole Genome Sequencing Data

jejacobs23jejacobs23 Portland, ORMember

Hello,
I am running GATK 4.0.12.0 and have a question regarding Mutect2. My tumor .bam file is 232 Gb and the matched normal .bam file is 136 Gb. I ran the following on our institution's clustering computing system after requesting 8G per CPU in the resource allocation.

srun $GATK/gatk --java-options "-Xmx2g" Mutect2 \
-R <path_to_ref> \
-I <path_to_tumor.bam> \
-I <path_to_normal.bam> \
-tumor $tNAME \
-normal $nNAME \
--germline-resource <path_to_germline_resource> \
--af-of-alleles-not-in-resource 0.00003125 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-O <path_to_outfile.vcf> \
-bamout <path_to_outfile.bam>

This ran for 10 days and then timed out. interestingly, the .vcf file is dated only 2 days after the job started and thus did not undergo any changes since that time. Also, it appears that the vcf.gz.tbi file was successfully created. However, the .bai file that goes with the "bamout" file was not successfully created.

My questions are as follows:
1) Could the "-bamout" option be somehow holding up the completion of this job?
2) Are there ways to optimized the performance of Mutect2 in regards to resource allocation (either in the "--java-options" or in my own request for resources to the cluster)?
3) More broadly, what processes are using resources in Mutect2? Does it require that large sets of data are stored to memory while being processes?

Tagged:

Best Answers

Answers

  • jejacobs23jejacobs23 Portland, ORMember

    Thank you. This was very helpful. It is working better now but has run into java heap space issues which I will post under the appropriate discussion topic. Thanks again.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @jejacobs23 A few comments:

    --java-options "-Xmx2g" This is not enough memory for Mutect2. This could easily cause slow performance, although usually I would expect it to just crash.

    -tumor $tNAME This argument is no longer needed.

    --germline-resource <path_to_germline_resource> Just to be sure, is this the resource in the GATK best practices bucket: gs://gatk-best-practices/?

    --af-of-alleles-not-in-resource 0.00003125 This argument is no longer needed.

    --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter This argument is no longer needed.

    1) Could the "-bamout" option be somehow holding up the completion of this job? We have never seen this happen before, so unlikely.

    3) More broadly, what processes are using resources in Mutect2? Does it require that large sets of data are stored to memory while being processes? The main demand on RAM is simply loading a bunch of reads at once.

  • jejacobs23jejacobs23 Portland, ORMember

    Thanks @davidben. This is really helpful.

    For the "-Xmx" option, is there an amount you recommend? My tumor .bam file is 232 Gb and the matched normal .bam file is 136 Gb

    I am using the hg38 alternate contigs in my variant analysis. Does the updated Mutect2 now automatically account for this? Is that why "--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter" is no longer needed? As a side not, I created my PONs using the "--disable-read-filter" option. Does this mean I'll have to redo all of them?

    The germline-resource I'm using is "af-only-gnomad.hg38.vcf.gz" downloaded from ftp://[email protected]/bundle/Mutect2/af-only-gnomad.hg38.vcf.gz

    Interestingly, I attempted to split the Mutect2 run up using various intervals and all but one of them finished in a reasonable amount of time. A single interval continued on for 4 days without any progress to the actual output files. The interval in question included chrX, chrY, chrM as well as all the "random" contigs, all the "chrUn" contigs, and almost all of the "_alt" contigs. Could the problem be with the contigs?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    Mutect2 usually stays under 4 GB for a 100x tumor and 100x normal, but it's best to give it 6GB to be safe. It depends on the depth, not the total size of the bam files. --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter is no longer needed because it has become the default, so your PONs are fine. The problem is most likely with the alt contigs. We don't have much experience with them. Even on our hg38-aligned data we usually exclude them from variant-calling as they can be hard to interpret.

    That being said, we always want to understand use cases and adapt the tool if there is sufficient demand. Could you explain what you are doing with the alt contigs and why you need them?

  • jejacobs23jejacobs23 Portland, ORMember

    Thanks @davidben

    I am analyzing several tumor genomes and their matched normal controls. I am attempting to follow the tutorial at https://gatkforums.broadinstitute.org/gatk/discussion/8017/how-to-map-reads-to-a-reference-with-alternate-contigs-like-grch38. It was my understanding that the whole point of using alt-aware alignment was that you then had the option to call variants on the alternate contigs. Thus, I am using the alternate contigs in my attempt to identify variants in the tumor relative to control. Please let me know if I have misunderstood this concept.

    Of note, one of my interval files which also contained a small number of alternate contigs finished without error.

    I'll try the Mutect2 run again with "-Xmx6g".

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @jejacobs23 Using hg38 has many advantages even if you only call on the primary contigs. Furthermore, calling on alt contigs is quite subtle and requires a non-standard pipeline, something we at the Broad don't do in production. I don't think you misunderstand the concept; rather I think you might have overestimated the extent to which we have a solution.

    Now, this is totally a cop-out answer. I regret that I can't do better. However, I can suggest some small computational experiments. First, because their is not yet an official hg38 gnomAD or ExAC our hg38 resources are lifted-over from their hg19 versions. This is probably okay for the primary contigs but must be perilous for at least some alt contigs. Could you try running Mutect2 without the --germline-resource argument? The results will be bad (full of germline variants and possibly contamination artifacts) but I'm curious as to the effect on runtime. Another, more tedious, thing to try would be run Mutect2 on different subsets of alt contigs to essentially find which ones cause problems by binary search.

    You don't have to do any of this, of course. Just running on chrs 1-22, X, Y, M is perfectly serviceable for general-purpose somatic variant calling. But if you do, we will be interested.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    First, because their is not yet an official hg38 gnomAD

    Apologies for my illiteracy. "Their" --> There.

Sign In or Register to comment.