Why is HaplotypeCaller spending 8 hrs on "Strictness is SILENT" step?


I am running HaplotypeCaller on whole genome re-sequenced (~10X coverage) African buffalo genomes, using a high coverage African buffalo genome as the reference (~90X). The genome is about 2.8Gb. The reference genome currently consists of 442 402 scaffolds and contigs.

HaplotypeCaller works fine and produces the expected output etc., but it is spending a lot of time on the "Strictness is SILENT" step in particular, but also on the "MicroScheduler" and "Preparing for traversal over 1 BAM files" steps (9 and 7 hrs, respectively) and is taking >36 hours per genome.

I know this is not a quick analysis and some steps will take a long time, but all the log files I've seen on the forum and from a colleague (working on smaller fungal genomes) have Strictness is SILENT steps of <1 min.

Why is HaplotypeCaller spending so much time on this step? Could it be because of the many scaffolds and contigs in the reference genome? Is there something I can do to speed up this step (and/or the other two steps with long processing times)?

I am running GATK v3.6-0-g89b7209 and Java 1.8.0_73-b02. I've allocated Java 10GB of memory (-Xmx10g), but have 125GB available. Would increasing the allocated RAM (to say about 40GB) help speed up some of these steps?

The log file:

<br />INFO  21:29:46,295 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  21:29:46,298 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 
INFO  21:29:46,298 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  21:29:46,298 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk 
INFO  21:29:46,298 HelpFormatter - [Mon Sep 25 21:29:46 SAST 2017] Executing on Linux 3.10.0-514.6.1.el7.x86_64 amd64 
INFO  21:29:46,298 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 JdkDeflater 
INFO  21:29:46,301 HelpFormatter - Program Args: -T HaplotypeCaller -R /mnt/lustre/users/djager/buf_clean/alignment_files/bam_sorted/gatk/GATK_Deon/refs/buffalo.final.fa -I /mnt/lustre/users/djager/buf_clean/alignment_files/bam_sorted/gatk/GATK_Deon/M_47_14_aln-PE_sorted_dups_marked.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -hets 0.03 -mbq 20 -stand_emit_conf 20 -stand_call_conf 30 -out_mode EMIT_ALL_CONFIDENT_SITES -nct 24 -ploidy 2 -o M_47_14_aln-PE_sorted_dups_marked_output.raw.snps.indels.g.vcf 
INFO  21:29:46,307 HelpFormatter - Executing as [email protected] on Linux 3.10.0-514.6.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02. 
INFO  21:29:46,307 HelpFormatter - Date/Time: 2017/09/25 21:29:46 
INFO  21:29:46,308 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  21:29:46,308 HelpFormatter - ---------------------------------------------------------------------------------- 
WARN  21:29:46,314 GATKVCFUtils - Naming your output file using the .g.vcf extension will automatically set the appropriate values  for --variant_index_type and --variant_index_parameter 
INFO  21:29:46,332 GenomeAnalysisEngine - Strictness is SILENT 
INFO  05:46:59,587 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 
INFO  05:46:59,737 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  05:47:07,772 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 8.03 
INFO  05:47:11,853 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 
INFO  05:47:13,107 MicroScheduler - Running the GATK in parallel mode with 24 total threads, 24 CPU thread(s) for each of 1 data thread(s), of 24 processors available on this machine 
INFO  14:05:27,691 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  21:57:03,635 GenomeAnalysisEngine - Done preparing for traversal 

Thanks and kind regards

Best Answers


  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Have you tried with removing -nct option first. Seems like the nanoscheduler is dead.

    Reminder. Use QUEUE scatter gather not -nct. -nct is not your friend. ;)

  • Thanks for the response SkyWarrior. I removed the -nct option and it still spent 7 hours on the Strictness is SILENT step.

    I then tried to use Queue and went through various versions of the script, adjusting things according to the error messages that came up. This is where I'm currently at and I'm not sure what to do from here. Could perhaps take a look, please?


    package org.broadinstitute.gatk.queue.qscripts
    import org.broadinstitute.gatk.queue.QScript
    import org.broadinstitute.gatk.queue.extensions.gatk._
    import org.broadinstitute.gatk.tools.walkers.haplotypecaller.ReferenceConfidenceMode
    class VariantCaller extends QScript {
        // Required arguments. All initialized to empty values.
        @Input(doc="The reference file for the bam files.", shortName="R", required=true)
        var referenceFile: File = _
        @Input(doc="One or more bam files.", shortName="I")
        var bamFile: Seq[File] = _
        @Input(doc="Output core filename.", shortName="O", required=true)
        var outputFilename: File = _
        @Argument(doc="Maxmem.", shortName="mem", required=true)
        var maxMem: Int = 40
        @Argument(doc="Number of cpu threads per data thread", shortName="nct", required=true)
        var numCPUThreads: Int = 24
        @Argument(doc="Number of scatters", shortName="nsc", required=true)
        var numScatters: Int = 24
        @Argument(doc="Minimum phred-scaled confidence to call variants", shortName="stand_call_conf", required=true)
        var standCallConf: Int = 30 //30 //default: best-practices value
        @Argument(doc="Minimum base quality required to consider a base for calling", shortName="min_base_quality_score", required=true)
        var mbq: Int = 20 //10 //default: best-practices value
        // The following arguments are all optional.
        @Argument(doc="Ploidy (number of chromosomes) per sample", shortName="ploidy", required=false)
    var samplePloidy: Int = 2
        @Argument(doc="", shortName="gvcf", required=false)
        var gvcf: Boolean = true
        def script() {
            // Define common settings for GVCF HC.
            trait HC_Arguments extends HaplotypeCaller {
                this.reference_sequence = referenceFile
                this.scatterCount = numScatters
                this.memoryLimit = maxMem
                this.num_cpu_threads_per_data_thread = numCPUThreads
                this.min_base_quality_score = mbq
                this.stand_call_conf = standCallConf
            // GVCF HC
            for (bamFile <- bamFile) {
            val haplotypeCaller = new HaplotypeCaller with HC_Arguments
                    // All required input
                    haplotypeCaller.input_file :+= bamFile
                    haplotypeCaller.out = swapExt(bamFile, "bam", "g.vcf")
                    // gVCF settings
                    haplotypeCaller.emitRefConfidence = ReferenceConfidenceMode.GVCF
            // Optional input
                    haplotypeCaller.sample_ploidy = samplePloidy
                    //add function to queue

    And the current error message:

    INFO  18:48:10,227 QScriptManager - Compiling 1 QScript
    INFO  18:48:11,660 QScriptManager - Compilation complete
    ##### ERROR --
    ##### ERROR stack trace
    Argument with name '--numscatters' (-nsc) is missing.
    Argument with name '--numcputhreads' (-nct) is missing.
    Argument with name '--outputfilename' (-O) is missing.
    Argument with name '--maxmem' (-mem) is missing.
    Argument with name '--mbq' (-min_base_quality_score) is missing.
            at org.broadinstitute.gatk.utils.commandline.ParsingEngine.validate(ParsingEngine.java:300)
            at org.broadinstitute.gatk.utils.commandline.ParsingEngine.validate(ParsingEngine.java:280)
            at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:225)
            at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
            at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:61)
            at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala)
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):
    ##### ERROR
    ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ##### ERROR Visit our website and forum for extensive documentation and answers to
    ##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: Argument with name '--numscatters' (-nsc) is missing.
    ##### ERROR Argument with name '--numcputhreads' (-nct) is missing.
    ##### ERROR Argument with name '--outputfilename' (-O) is missing.
    ##### ERROR Argument with name '--maxmem' (-mem) is missing.
    ##### ERROR Argument with name '--mbq' (-min_base_quality_score) is missing.
    ##### ERROR ------------------------------------------------------------------------------------------
    INFO  18:48:11,721 QCommandLine - Shutting down jobs. Please wait...
    /var/spool/PBS/mom_priv/jobs/724827.sched01.SC: line 22: -mbq: command not found

    I run the script as follows (on a cluster - I just excluded the part that talks to the scheduler, PBSPro):

    java -Xmx40g -jar Queue.jar \
            -S Qscript_HC.scala \
            -R /mnt/lustre/users/djager/buf_clean/alignment_files/bam_sorted/gatk/GATK_Deon/refs/buffalo.final.fa \
            -I /mnt/lustre/users/djager/buf_clean/alignment_files/bam_sorted/gatk/GATK_Deon/M_47_14_aln-PE_sorted_dups_marked.bam \
            -stand_call_conf 30
            -mbq 20

    Any ideas?

  • Wait, I changed the java command to include the commands it said it was missing:

    java -Xmx40g -jar Queue.jar \
            -S Qscript_HC.scala \
            -R /mnt/lustre/users/djager/buf_clean/alignment_files/bam_sorted/gatk/GATK_Deon/refs/buffalo.final.fa \
            -I /mnt/lustre/users/djager/buf_clean/alignment_files/bam_sorted/gatk/GATK_Deon/M_47_14_aln-PE_sorted_dups_marked.bam \
            -stand_call_conf 30 \
            --mbq 20 \
            -nsc 24 \
            -nct 24 \
            -O M_47_14_aln-PE_sorted_dups_marked.bam \
            -mem 40 \

    The Qscript was left unchanged from the one above and it says it runs and it does run for about 1 minute and then says it is done, but then it hasn't created any output files except a jobreport file.

    I've attached the error log file output file.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    Accepted Answer


    The reference genome currently consists of 442 402 scaffolds and contigs.

    There are too many contigs in the reference; GATK tools are not designed to deal with so many contigs. You can try stitching together the contigs using Ns. If you do a forum search, you will find more threads on how to do this.


  • DdJ_SADdJ_SA Member
    Accepted Answer


    Thank you! I will try that :)

    P.S. If anyone runs into a similar issue, I found this tool (https://bitbucket.org/dholab/scaffoldstitcher/src) via @ameliahaj on this thread https://gatkforums.broadinstitute.org/gatk/discussion/4774/snp-calling-using-pooled-rna-seq-data to join scaffolds/ contigs together. Looks super useful! Thanks @ameliahaj :)


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi Deon,

    Thanks for posting a solution too :smiley:


Sign In or Register to comment.