Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

MuTect2 seems to be hung, has created .vcf but run for many hours without creating the .idx

tedtoaltedtoal Member
edited September 2016 in Ask the GATK team

I've had several instances of MuTect2 running, on 5 samples. Three have finished and there is a .vcf and corresponding .idx file for each, each pair with the same time stamp. The other 2 samples have not finished although they have been running for 24 hours in one case and 15 hours in the other. The .vcf files for those two are present in the directory and their size is non-zero. I think files are usually size 0 while they are being created, which suggests to me that Mutect2 is finished making the .vcf files. There are no .idx files in the directory for those two. I assume that creating the index file is very fast. I'm wondering if somehow MuTect2 has hung? 'top' shows two processes running with Mutect2, and 'top -H' shows perhaps 26 additional processes that come and go, and are threads running within Mutect2, I think. I have not specified the -nct option so as far as I know Mutect2 should not be running extra threads (although maybe that just means it won't run a certain TYPE of thread?). I'm unsure whether to keep waiting for it to finish, or kill it. The file size has not changed all day.

Best Answer

Answers

  • tedtoaltedtoal Member

    Update: another sample finished and updated the file date and wrote a .idx file when it finished. One more sample is still running. It is very odd that each sample is taking roughly twice (or so) as long as the previous sample. Each is being run on a separate core. The bam file sizes vary, but the execution time is not at all correlated with total bam file size. The first four samples required this amount of time to run through Mutect2:

    1. 4 hours
    2. 13 hours
    3. 28 hours
    4. 57 hours

    The fifth is not done yet, and has run 98 hours so far (and output is not visible yet due to use of make -O flag).

    The first part of the log output from Mutect2 on one of the samples reads as follows (note that it says "Using AVX accelerated implementation of PairHMM"):

    INFO 16:50:36,496 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:50:36,499 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
    INFO 16:50:36,500 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 16:50:36,500 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 16:50:36,506 HelpFormatter - Program Args: --analysis_type MuTect2 --reference_sequence /share/carvajal-archive/REFERENCE_DATA/GATK_Bundle/b37/human_g1k_v37_decoy.fasta --intervals TargetRgns_PaddedMerged.bed -I:normal MAP/XJ831-CG-12N_recal.bam -I:tumor MAP/XJ831-CG-12T_recal.bam --out SOM_VARS/XJ831-CG-12.mutect.vcf --dbsnp /share/carvajal-archive/REFERENCE_DATA/GATK_Bundle/b37/dbsnp_138.b37.vcf --cosmic /share/carvajal-archive/REFERENCE_DATA/COSMIC/v74/VCF/CosmicCodingMuts.vcf.gz --annotateNDA --dontUseSoftClippedBases
    INFO 16:50:36,510 HelpFormatter - Executing as [email protected] on Linux 3.13.0-95-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_79-b15.
    INFO 16:50:36,511 HelpFormatter - Date/Time: 2016/09/21 16:50:36
    INFO 16:50:36,512 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:50:36,513 HelpFormatter - --------------------------------------------------------------------------------
    INFO 16:50:37,424 GenomeAnalysisEngine - Strictness is SILENT
    INFO 16:50:37,622 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 16:50:37,633 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 16:50:37,751 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.12
    INFO 16:50:44,819 IntervalUtils - Processing 100696230 bp from intervals
    WARN 16:50:44,845 IndexDictionaryUtils - Track cosmic doesn't have a sequence dictionary built in, skipping dictionary validation
    INFO 16:50:44,972 GenomeAnalysisEngine - Preparing for traversal over 2 BAM files
    INFO 16:51:02,147 GenomeAnalysisEngine - Done preparing for traversal
    INFO 16:51:02,149 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 16:51:02,151 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 16:51:02,152 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime
    INFO 16:51:02,390 MuTect2 - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
    Using AVX accelerated implementation of PairHMM
    INFO 16:51:20,453 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file
    INFO 16:51:20,454 VectorLoglessPairHMM - Using vectorized implementation of PairHMM

  • tedtoaltedtoal Member

    Update: my fifth and final sample is still running under MuTect2, total time now 133.5 hours. It's .bam file sizes are considerably less than the sizes of the sample that ran in 4.4 hours. I'm guessing something is wrong with it. I do see that the number of active regions processed does increase as MuTect2 execution time increases, but not linearly. The 4.4hr sample had 3692 active regions, 13.4hr had 13492, 28hr had 16984, and 58hr had 23596. I don't know how an active region is determined, but all samples used the same parameter settings.

  • tedtoaltedtoal Member

    Update 3: I started another instance running the same Mutect2 job so I could see what the Mutect2 log says, and it indicates that the time required will be a little over 5 days, in which case it should be finishing up soon. I also ran under GATK 3.6 and got a similar estimate. So, it looks like maybe everything is running ok. I'm surprised though. What would cause this sample to take 25X longer with Mutect2 than a sample that has LARGER .bam files?

  • SheilaSheila Broad InstituteMember, Broadie admin

    @tedtoal
    Hi,

    It looks like the longer the runtime, the more active regions there are. This is indeed expected. MuTect2 does a reassembly step for each active region which takes a lot of time. You can read more about it here (the same reassembly happens in MuTect2 as well as HaplotypeCaller).

    As for how active regions are determined, you can have a look here. In MuTect2, we use both the tumor and normal samples to determine active regions. It is the same as HaplotypeCaller, but if the normal sample shows evidence for the variant, it will not be considered an active region.

    What are the exact commands you are running? I suspect the BAM files that are taking very long have a lot of messy regions which are triggering active regions.

    -Sheila

  • tedtoaltedtoal Member

    Here is the command line. It is still running, and the runtime is now up to 5d. 14h.

    /share/PACKAGES/src/java/jdk1.7.0_79/bin/java -Xmx16g -Djava.io.tmpdir=/share/tmp -jar /share/PACKAGES/src/GATK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar --analysis_type MuTect2 --reference_sequence /share/REFERENCE_DATA/GATK_Bundle/b37/human_g1k_v37_decoy.fasta --intervals TargetRgns_PaddedMerged.bed -I:normal MAP/GM022584-CP-H1103180N_recal.bam -I:tumor MAP/GM022584-CP-H1103180T_recal.bam --out SOM_VARS/GM022584-CP-H1103180.mutect.vcf --dbsnp /share/REFERENCE_DATA/GATK_Bundle/b37/dbsnp_138.b37.vcf --cosmic /share/REFERENCE_DATA/COSMIC/v74/VCF/CosmicCodingMuts.vcf.gz --annotateNDA --dontUseSoftClippedBases

  • SheilaSheila Broad InstituteMember, Broadie admin

    @tedtoal
    Hi,

    Did the runs finish? We have a team dedicated to improving MuTect2 now, so improving the runtime is definitely a high priority.

    -Sheila

  • tedtoaltedtoal Member

    Yes, it did finally finish. The longest run lasted for 6 days, 6-1/2 hours. The initial time that it estimated was a little short of 5 days. There was a huge range of times required, as shown in my first comment. And when it finished, it created the .idx file. I've changed my pipeline to log mutect2 output as it occurs so I can check its time estimate and other log info while it is running (my other logging of 'make' output only logged when a job finished, when using -O option on make).

Sign In or Register to comment.