To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

GATK 3.7 and GATK 4 beta2

Dear team,

I am using GATK 4 Beta2 for testing HaplotypeCaller for our NGS workflow.

The command which I used is:

time -p /gpfs/software/genomics/GATK/4b.2/gatk/gatk-launch HaplotypeCaller \
--reference /gpfs/data_jrnas1/ref_data/Hsapiens/hs37d5/hs37d5.fa \
--input NA12892.recal.bam \
--dbsnp /gpfs/data_jrnas1/ref_data/Hsapiens/GRCh37/variation/dbsnp_138.vcf.gz \
--emitRefConfidence GVCF \
--readValidationStringency LENIENT \
--nativePairHmmThreads 32 \
--createOutputVariantIndex true \
--output NA12892.raw.snps.indels.g.vcf

This execution time for GATK 4 Beta2 is: 51 Hours, 32 min

Alternatively, I was running the same sample (NA12892) using GATK 3.7 using the following command:

_time -p java -XX:+UseParallelGC -XX:ParallelGCThreads=32 -Xmx128g \
-jar /gpfs/software/genomics/GATK/3.7/base/GenomeAnalysisTK.jar -T HaplotypeCaller \
-nct 8 -pairHMM VECTOR_LOGLESS_CACHING \
-R /gpfs/data_jrnas1/ref_data/Hsapiens/hs37d5/hs37d5.fa \
-I NA12892.realigned.recal.bam -\
-emitRefConfidence GVCF \
--variant_index_type LINEAR \
--variant_index_parameter 128000 \
--dbsnp /gpfs/data_jrnas1/ref_data/Hsapiens/GRCh37/variation/dbsnp_138.vcf.gz \
-o NA12892.raw.snps.indels.g.vcf _

This execution time for GATK 3.7 is: 18 Hours, 12 min

I don't know, how to use multithreads (e.g. -nct) for GATK 4 version to reduce the execution time on the single node. Because, we have 32 cores per node with 512GB memory available for benchmarking. To parallelize the GATK 4 workload, I used the Spark version also.

I used GATK 4 Beta2 Spark job on the cluster of 32 nodes (32 nodes x 32 cores, totaling 1024 cores). The execution time is almost same as GATK 4 Beta2 ( 50 Hours, 21 min).

Please help me, how to reduce the execution time for GATK 4 Beta2 HaplotypeCaller?

Please see this below Spark logs:

  • /gpfs/software/spark/spark-2.1.0-bin-hadoop2.7//bin/spark-submit --master spark://nsnode11:6311 --driver-java-options -Dsamjdk.use_async_io_read_samtools=false,-Dsamjdk.use_async_io_write_samtools=true,-Dsamjdk.use_async_io_write_tribble=false,-Dsamjdk.compression_level=1 --conf spark.io.compression.codec=snappy --conf spark.yarn.executor.memoryOverhead=6000 --conf spark.kryoserializer.buffer.max=512m --conf spark.driver.userClassPathFirst=true --conf spark.driver.maxResultSize=0 --conf spark.executor.cores=1024 --conf spark.reducer.maxSizeInFlight=100m --conf spark.shuffle.file.buffer=512k --conf spark.akka.frameSize=512 --conf spark.akka.threads=10 --conf spark.executor.memory=50g --conf spark.driver.memory=150g --conf spark.local.dir=/gpfs/projects/NAGA/naga/NGS/pipeline/GATK_Best_Practices/GATK4b2Spark/1024cores/tmp --class org.broadinstitute.hellbender.Main /gpfs/software/genomics/GATK/4b.2/gatk/build/libs/hellbender-spark.jar HaplotypeCaller --reference /gpfs/data_jrnas1/ref_data/Hsapiens/hs37d5/hs37d5.fa --input /gpfs/projects/NAGA/naga/NGS/pipeline/GATK_Best_Practices/GATK4b2/bam//NA12892.recal.bam --dbsnp /gpfs/projects/NAGA/naga/SparkTest/SPARKCALLER/REF/dbsnp_138.vcf --emitRefConfidence GVCF --readValidationStringency LENIENT --nativePairHmmThreads 1024 --createOutputVariantIndex true --output NA12892.raw.snps.indels.g.vcf
    [August 9, 2017 10:13:02 AM AST] HaplotypeCaller --nativePairHmmThreads 1024 --dbsnp /gpfs/projects/NAGA/naga/SparkTest/SPARKCALLER/REF/dbsnp_138.vcf --emitRefConfidence GVCF --output NA12892.raw.snps.indels.g.vcf --input /gpfs/projects/NAGA/naga/NGS/pipeline/GATK_Best_Practices/GATK4b2/bam//NA12892.recal.bam --readValidationStringency LENIENT --reference /gpfs/data_jrnas1/ref_data/Hsapiens/hs37d5/hs37d5.fa --createOutputVariantIndex true --group StandardAnnotation --group StandardHCAnnotation --GVCFGQBands 1 --GVCFGQBands 2 --GVCFGQBands 3 --GVCFGQBands 4 --GVCFGQBands 5 --GVCFGQBands 6 --GVCFGQBands 7 --GVCFGQBands 8 --GVCFGQBands 9 --GVCFGQBands 10 --GVCFGQBands 11 --GVCFGQBands 12 --GVCFGQBands 13 --GVCFGQBands 14 --GVCFGQBands 15 --GVCFGQBands 16 --GVCFGQBands 17 --GVCFGQBands 18 --GVCFGQBands 19 --GVCFGQBands 20 --GVCFGQBands 21 --GVCFGQBands 22 --GVCFGQBands 23 --GVCFGQBands 24 --GVCFGQBands 25 --GVCFGQBands 26 --GVCFGQBands 27 --GVCFGQBands 28 --GVCFGQBands 29 --GVCFGQBands 30 --GVCFGQBands 31 --GVCFGQBands 32 --GVCFGQBands 33 --GVCFGQBands 34 --GVCFGQBands 35 --GVCFGQBands 36 --GVCFGQBands 37 --GVCFGQBands 38 --GVCFGQBands 39 --GVCFGQBands 40 --GVCFGQBands 41 --GVCFGQBands 42 --GVCFGQBands 43 --GVCFGQBands 44 --GVCFGQBands 45 --GVCFGQBands 46 --GVCFGQBands 47 --GVCFGQBands 48 --GVCFGQBands 49 --GVCFGQBands 50 --GVCFGQBands 51 --GVCFGQBands 52 --GVCFGQBands 53 --GVCFGQBands 54 --GVCFGQBands 55 --GVCFGQBands 56 --GVCFGQBands 57 --GVCFGQBands 58 --GVCFGQBands 59 --GVCFGQBands 60 --GVCFGQBands 70 --GVCFGQBands 80 --GVCFGQBands 90 --GVCFGQBands 99 --indelSizeToEliminateInRefModel 10 --useAllelesTrigger false --dontTrimActiveRegions false --maxDiscARExtension 25 --maxGGAARExtension 300 --paddingAroundIndels 150 --paddingAroundSNPs 20 --kmerSize 10 --kmerSize 25 --dontIncreaseKmerSizesForCycles false --allowNonUniqueKmersInRef false --numPruningSamples 1 --recoverDanglingHeads false --doNotRecoverDanglingBranches false --minDanglingBranchLength 4 --consensus false --maxNumHaplotypesInPopulation 128 --errorCorrectKmers false --minPruning 2 --debugGraphTransformations false --kmerLengthForReadErrorCorrection 25 --minObservationsForKmerToBeSolid 20 --likelihoodCalculationEngine PairHMM --base_quality_score_threshold 18 --gcpHMM 10 --pair_hmm_implementation FASTEST_AVAILABLE --pcr_indel_model CONSERVATIVE --phredScaledGlobalReadMismappingRate 45 --useDoublePrecision false --debug false --useFilteredReadsForAnnotations false --bamWriterType CALLED_HAPLOTYPES --disableOptimizations false --justDetermineActiveRegions false --dontGenotype false --dontUseSoftClippedBases false --captureAssemblyFailureBAM false --errorCorrectReads false --doNotRunPhysicalPhasing false --min_base_quality_score 10 --useNewAFCalculator false --annotateNDA false --heterozygosity 0.001 --indel_heterozygosity 1.25E-4 --heterozygosity_stdev 0.01 --standard_min_confidence_threshold_for_calling 10.0 --max_alternate_alleles 6 --max_genotype_count 1024 --sample_ploidy 2 --genotyping_mode DISCOVERY --contamination_fraction_to_filter 0.0 --output_mode EMIT_VARIANTS_ONLY --allSitePLs false --readShardSize 5000 --readShardPadding 100 --minAssemblyRegionSize 50 --maxAssemblyRegionSize 300 --assemblyRegionPadding 100 --maxReadsPerAlignmentStart 50 --activeProbabilityThreshold 0.002 --maxProbPropagationDistance 50 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --disableToolDefaultReadFilters false --minimumMappingQuality 20
    [August 9, 2017 10:13:02 AM AST] Executing as nkathiresan@nsnode11 on Linux 3.10.0-229.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13; Version: 4.beta.2-14-g4229219-SNAPSHOT
    [INFO] Available threads: 32
    [INFO] Requested threads: 1024
    [WARNING] Using 32 available threads, but 1024 were requested
    log4j:WARN No appenders could be found for logger (org.broadinstitute.hellbender.utils.MathUtils$Log10Cache).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    **[August 11, 2017 12:34:22 PM AST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. **Elapsed time: 3,021.34 minutes.****
    Runtime.totalMemory()=57773916160

  • /gpfs/software/spark/spark-2.1.0-bin-hadoop2.7//sbin/stop-master.sh

Thanks a lot,
With Regards,
Naga

Issue · Github
by Sheila

Issue Number
3631
State
closed
Last Updated
Assignee
Array
Closed By
droazen

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @naga
    Hi Naga,

    Can you confirm this happens with the latest beta release?

    Thanks,
    Sheila

  • naganaga qatarMember

    Hi Sheila,

    Yes, I used GATK4 Beta5 for our testing.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @naga
    Hi Naga,

    Alright, thanks for confirming. Let me ask the team and get back to you.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @naga
    Hi Naga,

    I heard back from a developer, and he has some questions/tips which I will relay here.

    1) Do you have the log (stdout + stderr) for the GATK4 non-spark run? Can you post it? The developers need to know which pairhmm you are running with.

    2) What sort of hardware are you running on? Specifically, is it an Intel machine with support for AVX?

    3) A good setting for--nativePairHmmThreads is probably 4-8; you won't see any improvement after that.

    4) You are setting -XX:+UseParallelGC -XX:ParallelGCThreads=32 for the GATK3 run. You would be better off setting it to 2-4 threads. Performance gets worse beyond that typically from what the developers have seen. You can set the same thing in GATK4 using--javaOptions ' -XX:+UseParallelGC -XX:ParallelGCThreads=4' This will also give us a better comparison.

    5) In general, you want executors with ~4-8 cores and at least 4g of memory per core. How much memory do your nodes have and are you running with autoscaling turned on? I suspect you are only allocating 1 executor on 1 node and then it's thrashing memory because it's trying to run 32 threads at once. Spark tuning for HaplotypeCaller is going to be complicated. The developers don't know how to do it well yet, but they will be working on it in the next few months. It looks like you are running with Spark 2.1.0. GATK4 currently requires Spark 2.0.2 (which is unfortunately a specific version), but the developers are planning on upgrading to spark 2.2.+ in the next few months.

    To be clear, the results will not be the same between GATK3, GATK4, and GATK4-spark yet. GATK4 is in a rapid state of flux and has known performance issues that the developers are planning on working soon.

    -Sheila

  • shleeshlee CambridgeMember, Broadie, Moderator

    Just to followup on your Spark runs @naga, I've outlined some basic considerations in setting Spark parameters in https://gatkforums.broadinstitute.org/gatk/discussion/10060/how-to-run-flagstatspark-on-a-cloud-spark-cluster, in case you want to experiment with the settings yourself. I developed and wrote this example tutorial back in July and so some of its elements may now already be outdated.

    Also, you may be interested in my post today in this thread. Certain default setting have changed between GATK3.7 and GATK4, namely the inflator/deflator. These will impact performance and runtimes.

  • naganaga qatarMember

    @Sheila

    Hello Sheila,
    Apologies for the delay in response. Many thanks for helping me on the GATK4 optimization.
    1. The STDOUT and STDERR logs are attached. Also, we repeated the test with GATK4b5 and the performance improvement was observed about 12% (and all the log files are attached). Yet to try with GATK 4.0 release version.
    2. We are using Intel Sandy Bridge (4 socket, 8 cores per socket) and Intel Haswell (2 socket and 16 cores per socket) processor.
    3. We used --nativePairHmmThreads=32.
    4. I will redo the experiment with Parallel GC threads with 4. But, my observation is 32 threads are good for PairHMM library. Please see the screenshot and this is for small dataset which is running approximately 3 hours. .
    5. All our nodes are 256GB memory (8GB per core). I will redo the experiment for Spark. Also, I will take input from you and @shlee for this repeated test.

    Once again, may thanks for helping me on this performance debugging.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited January 31

    @naga
    Hi Naga,

    Thanks for getting back. I will relay this to the team. Please do let us know about GATK4 proper release results.

    -Sheila

    EDIT: The team has requested you check with GATK4 or 4.0.1.0 release, as there were major performance improvements to the HaplotypeCaller just before the release.

    Post edited by Sheila on
  • naganaga qatarMember

    @Sheila,

    Hi Sheila,

    I repeated the experiment with GATK4.0.0 version. The performance is much better than GATK4beta5 version. Here are the logs:

    $ tail -400 NA12892.HaplotypeCaller.err
    Using GATK jar /gpfs/software/genomics/GATK/4.0.0/gatk-package-4.0.0.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -jar /gpfs/software/genomics/GATK/4.0.0/gatk-package-4.0.0.0-local.jar HaplotypeCaller --reference /gpfs/data_jrnas1/ref_data/Hsapiens/hs37d5/hs37d5.fa --input /gpfs/projects/NAGA/naga/NGS/pipeline/GATK_Best_Practices/GATK4.0.0/NA12892/bam/NA12892.recal.bam --dbsnp /gpfs/data_jrnas1/ref_data/Hsapiens/GRCh37/variation/dbsnp_138.vcf.gz --emit-ref-confidence GVCF --read-validation-stringency LENIENT --native-pair-hmm-threads 32 --output /gpfs/projects/NAGA/naga/NGS/pipeline/GATK_Best_Practices/GATK4.0.0/NA12892/vcf/NA12892.raw.snps.indels.g.vcf
    [January 26, 2018 1:09:58 AM AST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 2,133.48 minutes.
    Runtime.totalMemory()=2183659520
    real 128010.56
    user 436969.62
    sys 3030.18

    Thanks and Regards,
    Naga

    Issue · Github
    by Sheila

    Issue Number
    4361
    State
    open
    Last Updated
    Assignee
    Array
  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited February 6

    @naga
    Hi Naga,

    Wonderful news. I am assuming this issue can be closed.

    -Sheila

    EDIT: What kind of data are you running on? Exome or genome? How many samples? Also, do you see more speedup with Spark version? Thanks

  • naganaga qatarMember

    @Sheila
    Hi Sheila,
    Many thanks for continuous support and help.
    Looks, the GATK3.7 Haplotype caller execution time is still better (18.21 hours). I used num_cpu_threads_per_data_thread=8 (HaplotypeCaller -nct 8) in GATK3.7 and I don't have any multi-threading options in GATK4.0. Additionally, I submitted the job through LSF and 32 cores node is used. I can see, 32 IntelPairHmm threads are executed and the log is here for your reference:

    13:36:41.128 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
    13:36:41.630 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/gpfs/software/genomics/GATK/4.0.0/gatk-package-4.0.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
    13:36:41.647 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/gpfs/software/genomics/GATK/4.0.0/gatk-package-4.0.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
    13:36:41.724 WARN IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
    13:36:41.724 INFO IntelPairHmm - Available threads: 32
    13:36:41.724 INFO IntelPairHmm - Requested threads: 32
    13:36:41.724 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation

    As we know, the scalability of GATK for larger number of threads are not good, Is there an option to minimize the number of IntelPairHmm threads in GATK4.0? Please advise.

    I am using Platinum Genome (WGS, set of high-confidence variant calls from NA12878, NA12891 and NA12892). We are setting a Spark cluster and update you the results once done.

    Thanks and Regards,
    Naga

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @naga
    Hi Naga,

    I have some things from the developer for you to test.

    1) Run both GATK 3.7 and GATK 4.0.1.1 (latest release) on the same machine, one right after another, on chromosome 20 only (using -L in both cases), and ensure that there are no other expensive processes running on this machine during the tests. Run each version 3 times, and take the average of the results.
    2) Add -pairHMM AVX_LOGLESS_CACHING to both the GATK3 and GATK4 command lines, to guarantee that the native PairHMM will be used in both cases.
    3) Get rid of the --native-pair-hmm-threads 32 in the GATK 4 command line. Too many threads can sometimes make performance worse by introducing too much contention.
    4) Check both the GATK3 and GATK4 output to ensure that the Intel inflater and deflater were used in both cases.
    5) Check both the GATK3 and GATK4 command lines to be sure they are equivalent (eg., if one is running with -ERC GVCF, the other one should as well).
    6) Compute wall-clock time by running the Unix time command, if you are not already doing so (eg., time ./gatk HaplotypeCaller....).

    Thanks,
    Sheila

Sign In or Register to comment.