Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Question about Mutect2 Runtime on Whole Genome Sequencing Data
I am running GATK 184.108.40.206 and have a question regarding Mutect2. My tumor .bam file is 232 Gb and the matched normal .bam file is 136 Gb. I ran the following on our institution's clustering computing system after requesting 8G per CPU in the resource allocation.
srun $GATK/gatk --java-options "-Xmx2g" Mutect2 \ -R <path_to_ref> \ -I <path_to_tumor.bam> \ -I <path_to_normal.bam> \ -tumor $tNAME \ -normal $nNAME \ --germline-resource <path_to_germline_resource> \ --af-of-alleles-not-in-resource 0.00003125 \ --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \ -O <path_to_outfile.vcf> \ -bamout <path_to_outfile.bam>
This ran for 10 days and then timed out. interestingly, the .vcf file is dated only 2 days after the job started and thus did not undergo any changes since that time. Also, it appears that the vcf.gz.tbi file was successfully created. However, the .bai file that goes with the "bamout" file was not successfully created.
My questions are as follows:
1) Could the "-bamout" option be somehow holding up the completion of this job?
2) Are there ways to optimized the performance of Mutect2 in regards to resource allocation (either in the "--java-options" or in my own request for resources to the cluster)?
3) More broadly, what processes are using resources in Mutect2? Does it require that large sets of data are stored to memory while being processes?