Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Mutect2 speed varies greatly fold across regions

Hi,

I am trying to run Mutect2 on WES data on a cluster where I need to use 24 cores simultaneously. To parallelize this I used SplitIntervals to create 24 equal intervals and run a single instance of Mutect2 per interval on a single core.
However, I am finding that the speed of Mutect2 varies enormously across regions. Between samples the speed differences can be up to 20-times (250 reads/minute versus 5000+ for some samples). All samples have the same reference genome and the size is comparable (+- 30 GB normal and tumor sample size). Is it normal that some samples just take this much longer?
This however is not my main problem. The biggest problem is that the speed on each region within a patient can also vary a lot, by a factor 5 sometimes. This often leads to a single region taking way longer than all others, meaning I am using 24 cores while only 1 is being used. Effectively, I am using less than 50% of cores on average. Since I have limited computing time on the cluster, is there any way to make this more effective? Are regions with a higher gene density always slower (of course they would be, but this much slower?)? I did notice that the interval containing chromosome 19 - which has the highest gene density - always is slower as expected. I would expect that I am not the first person run into this problem, does anyone have a solution for this or should I simply split the genome into 24 parts with equal number of genes? I tried searching for solutions but this seems to be the only option I could find.

Best Answer

Answers

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    Hi @TomvdBosch

    This may have already be explained in another forum thread, mind reading through the following comments and answers to see if they answer your question.
    Mutect2 parallel problem

    What version of GATK are you using? There has been some improvements in the latest version of GATK4 (noted by this blog) that may help.

    What is your Mutect2 command?

  • TomvdBoschTomvdBosch Member
    Thanks for the response @bshifaw

    The thread that you linked has some very useful responses, I mainly was not aware that for Exome Sequencing it is recommended to use the -L command with exon regions for speed, I will try this. But the responses and links in the thread do not solve my problem of parallelization across the genome such that each Mutect2 instance runs for about as long (some difference is fine, but not a 100% difference between the slowest and second slowest as I sometimes see at the moment).

    I am using GATK 4.1.2.0, though the same occurs in 4.0.11.0, I have never used GATK3.
    My Mutect2 command is pretty standard:
    ./gatk Mutect2 \
    -R ucsc.hg19.fasta \
    -I tumor.bam \
    -tumor \
    -I normal.bam \
    -normal \
    -O output.vcf \
    -L interval.bed \

    Which is what I run on my own computer. For on the cluster, I add --java-options "Xmx2400M" as otherwise it kills my job for using too much memory. I did however not see any significant decrease in running time with this limitation added.
  • TomvdBoschTomvdBosch Member
    Thanks, I think I have enough information to solve my problem now.
Sign In or Register to comment.