Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

About java threads in GATK

prasundutta87prasundutta87 EdinburghMember

Since GATK is based on Java, and java is known for spawning multiple threads for many GATK applications like Haplotypecaller, CombineGVCF, GenotypeGVCF, GenomicsDBimport and so on, there are ways to control such thread spawning when you have limited computing resource. In some GATK forums, we have been advised to use -XX:ConcGCThreads and also -XX:ParallelGCThreads. Since their usage requires in-depth understanding of how java multithreading works, it is difficult for me to understand when to use what. Can someone from the GATK developers team explain which -XX method should be used?

Tagged:

Issue · Github
by Sheila

Issue Number
3114
State
closed
Last Updated
Assignee
Array
Closed By
chandrans

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @prasundutta87
    Hi,

    I have to ask someone on the team to respond :smile: We will get back to you asap.

    -Sheila

  • prasundutta87prasundutta87 EdinburghMember
    edited June 2018

    Sure Sheila! Thanks a lot.

    There is a follow up question to this one.

    I have been glancing through some Haplotypecaller error outputs from version 4.0.1.2 and 4.0.4.0. I have found a difference in the pairHMM threading error output.

    From 4.0.1.2:

    00:10:19.323 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/exports/eddie3_homes_local/s0928794/tools/gatk-package-4.0.1.2-local.jar!/com/intel/gkl/native/libgkl_utils.so
    00:10:19.325 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/exports/eddie3_homes_local/s0928794/tools/gatk-package-4.0.1.2-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
    00:10:19.380 WARN IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
    00:10:19.381 INFO IntelPairHmm - Available threads: 16
    00:10:19.381 INFO IntelPairHmm - Requested threads: 8
    00:10:19.381 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation

    From 4.0.4.0:

    15:51:47.997 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/gatk-4.0.4.0/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
    15:51:47.998 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/gatk-4.0.4.0/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
    15:51:48.047 WARN IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
    15:51:48.047 INFO IntelPairHmm - Available threads: 1
    15:51:48.048 INFO IntelPairHmm - Requested threads: 4
    15:51:48.048 WARN IntelPairHmm - Using 1 available threads, but 4 were requested
    15:51:48.048 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation

    You see what happened? This has a link to this question: https://gatkforums.broadinstitute.org/gatk/discussion/11434/no-of-cores-utilization-in-haplotypcaller-in-gvcf-mode#latest

    In version 4.0.1.2, Haplotypecaller (Java in particular) was accessing other cores in the nodes even if it was assigned 1 core to run on a Grid engine. In version 4.0.4.0, this problem does not seem to arise and a specific error output (or Warning) produced giving us the information that only 1 core is being used (because that was provided to the program) instead of default pairHMM thread requirement i.e. 4.

    Turns out that some improvement was done in Haplotypecaller code (I don't know exactly what because there was no update related to this on github). Haplotypecaller (or Java itself) limited itself to 1 core when 1 core was assigned to it in the Oracle grid engine. Could you elaborate on this improvement?

    Or should this question be cross-posted in the above mentioned link as well?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @prasundutta87
    Hi,

    I just messaged the developer. I will let you know when he gets back to me.

    -Sheila

  • prasundutta87prasundutta87 EdinburghMember

    Sure..thank you..did anyone inform about difference between -XX:ConcGCThreads and also -XX:ParallelGCThreads?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @prasundutta87
    Hi,

    I just pinged them again. One of us will get back to you soon.

    -Sheila

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭

    I'm honestly not sure what changed. I don't remember any changes to haplotype caller threading happening recently. Is it possible something on grid engine changed so that it presents fewer cores to java? I believe we just detect the number of cores that the system tells us are available and choose how many threads to use based on that. There are changes in java 9 that make this sort of thing more robust in docker, but I didn't think there were any changes in java 8 or our codebase that should make changes here. My theory is that maybe there was an upgrade to grid engine?

    If grid engine is presenting the right number of cores, java should make reasonable decision for garbage collection threads. The issues occur when java thinks there are many cores available on the machine, ie. 48 cores for a server, and tries to use them all for gc even though we're running essentially a single threaded application.

  • prasundutta87prasundutta87 EdinburghMember

    Thank you Shela for sharing this.

  • manolismanolis Member ✭✭
    edited November 2018

    (GATK 4.0.11.0)

    Hi,

    I have access to a linux server , without the possibility to use Spark and WDL, and when I'm running HaplotypeCaller and other GATK tools they use around 70 threads each code!!!

    I need to limit the number of threads!

    openjdk version "1.8.0_121"
    OpenJDK Runtime Environment (Zulu 8.20.0.5-linux64) (build 1.8.0_121-b15)
    OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-linux64) (build 25.121-b15, mixed mode)

    and in HC I'm using these options:

    -XX:GCTimeLimit=50
    -XX:GCHeapFreeLimit=10
    -XX:ConcGCThreads=1
    -XX:ParallelGCThread=1
    

    In this way I have around 30 threads...

    Limiting the threads of GATK tools can affect the results/output file as in GenomicsDBImport and other?

    Hoping that you have some time could you give me any advice please!?

    Many thanks

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @manolis

    Those are all Java options that affect garbage collection (hence the "GC" in their name), not the logical behavior of any GATK tool.

    If you want to try reducing the number of garbage collector threads, it will not affect the output of the program at all because they only affect how java goes about freeing the memory of old variables and data that are no longer in use. The downside to decreasing them too much is that garbage collection may run more slowly and thus slow down processing.

    Regards
    Bhanu

Sign In or Register to comment.