This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Question about Mutect2 Runtime on Whole Genome Sequencing Data
I am running GATK 22.214.171.124 and have a question regarding Mutect2. My tumor .bam file is 232 Gb and the matched normal .bam file is 136 Gb. I ran the following on our institution's clustering computing system after requesting 8G per CPU in the resource allocation.
srun $GATK/gatk --java-options "-Xmx2g" Mutect2 \ -R <path_to_ref> \ -I <path_to_tumor.bam> \ -I <path_to_normal.bam> \ -tumor $tNAME \ -normal $nNAME \ --germline-resource <path_to_germline_resource> \ --af-of-alleles-not-in-resource 0.00003125 \ --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \ -O <path_to_outfile.vcf> \ -bamout <path_to_outfile.bam>
This ran for 10 days and then timed out. interestingly, the .vcf file is dated only 2 days after the job started and thus did not undergo any changes since that time. Also, it appears that the vcf.gz.tbi file was successfully created. However, the .bai file that goes with the "bamout" file was not successfully created.
My questions are as follows:
1) Could the "-bamout" option be somehow holding up the completion of this job?
2) Are there ways to optimized the performance of Mutect2 in regards to resource allocation (either in the "--java-options" or in my own request for resources to the cluster)?
3) More broadly, what processes are using resources in Mutect2? Does it require that large sets of data are stored to memory while being processes?