We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
How do I limit thread usage (specifically for SortSam, but I may need this more generally)

I am trying to use GATK 4.1.3.0 in a fairly "old-school" environment to process some whole-exome sequencing reads and ultimately call variants. The environment is a single server node with reasonably large RAM (768GB) and 64 CPUs. I have 61 samples which were split over multiple lanes (and runs, in some cases), and consequently I have 238 bam files after aligning each pair of fastqs with BWA. The plan is to combine these bams into a single bam per sample at the step when I mark duplicate reads.
Before that, I need to sort the bam files, which I'm trying to do with gatk SortSam. I'm doing this with a simple shell script which batches these into groups of 34 and runs 34 processes in parallel. I'm limiting the memory per process so I am well within the limits of the box. Essentially this looks like:
GATK=/opt/gatk-4.1.3.0/gatk MAX_JVMS=34 # 238 bam files gives seven batches MAX_MEM=8g # original bam files are in subfolders (one per run) of aligned: source_dir=aligned dest_dir=sorted # recreate same directory structure under destination directory: for run_dir in ${source_dir}/* ; do IFS='/' read src run <<< "$run_dir" mkdir -p ${dest_dir}/$run done source_files=(${source_dir}/*/*.bam) num_files=${#source_files[@]} start=0 while [ $start -lt $num_files ] ; do # slice array so we only run $MAX_JVMS processes at a time: for f in ${source_files[@]:start:MAX_JVMS} ; do echo "Sorting $f" $GATK --java-options -Xmx${MAX_MEM} SortSam -I $f -O ${f/$source_dir/$dest_dir} --SORT_ORDER coordinate --CREATE_INDEX & done wait start=$[ start + MAX_JVMS ] done
I can verify that there are only at most 34 of these running at once, and that the memory consumption is not an issue. The problem is that each instance of GATK is creating multiple threads, and consequently I am ending up with thread starvation issues, and I'm seeing errors of the form
Exception in thread "ForkJoinPool.commonPool-worker-27" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ForkJoinPool.createWorker(ForkJoinPool.java:1486) at java.util.concurrent.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1517) at java.util.concurrent.ForkJoinPool.signalWork(ForkJoinPool.java:1634) at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1733) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1691) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
in the logs (with the corresponding instance of gatk failing). Again, this isn't a heap memory allocation error, it's a native thread allocation error. As far as I can tell, each instance of GATK seems to be assuming it has all 64 CPUS to play with, and is trying to allocate threads accordingly.
I can't find any options to limit the number of threads for each individual process. I'm looking for something equivalent to the [email protected] option in samtools, etc.
While this is currently specific to SortSam, I anticipate I'm going to need similar functionality throughout the pipeline.
I'm aware that the latest version of GATK is really aimed at somewhat different architectures (HPC clusters where each process effectively has its own node on which to run, either as a standalone cluster or one which is cloud-based); however this is the environment in which I am currently constrained to run. I was able to make these pipelines work with GATK3 (and earlier) in this environment, but haven't been able to do so with GATK4.
Answers
Hi @jdenvir ,
Try reviewing the following article, also possibly implementing other useful java options here.
Hi @bshifaw ,
Thanks for the response. I tried this using variations of the following:
which complains that
--num-executors
and--executor-cores
are not valid options.Using, e.g.
runs, but shows the same behavior, i.e. it just grabs as many threads as possible (judging by CPU usage). This is the same with or without the
-XX:ConcGCThreads
option to the JVM. (It is impressively fast, though...)I see the same thing with
MarkDuplicatesSpark
. Am I missing how to use these properties correctly?Thanks,
Jim
I'll check with the dev team and get back to you.
Thanks @bshifaw
In case they need version info, etc:
jdenvir
The team suggested to double check the memory while your running the command.
free -h --si -s 5 > memory.txt
and run it in the background with&
try
-XXgcThreads
as a java optionIn order to use
--num-executors
and--executor-cores
you would have to setup a master node locally and use the following parameter--SparkMaster local[number of cores]
Thanks @bshifaw
Memory is definitely not the issue; nor is the GC consuming anything like that many threads.
So does this mean there's no way to control thread allocation without setting up a local Spark framework? That seems like a pretty huge dependency for fairly basic functionality. All the built-in Java executor services (e.g.
ForkJoinPool
) allow a level of parallelism to be specified completely independently of the environment.Hi @jdenvir. I'm one of the GATK devs.
I wanted to step in and help clarify the situation with spark core usage. The
--num-executors
and--executor-cores
spark arguments are only relevant if you're running a spark cluster. The way you are running spark is as a stand alone spark execution within a single process. The way to control the number of cores in that case is by specifying the spark master as--spark-master 'LOCAL[4]'
where the number in brackets[]
is the number of cores you want. The default is*
which is all available cores which explains what you're seeing.That's why the GATK doesn't accept
--num-executors
when running in local mode. (You can as you discovered pass them as spark arguments, but they just get ignored then.) It's admittedly confusing and could probably be better explained. There is some documentation about it here for future reference.I'm not sure what's going on with the original thread memory issue you started this thread about. I have a few things to try:
1: GATK shouldn't use very many threads, with exception of the garbage collection threads. However, garbage collection can allocate 1 thread per core per java process, so it seems possible that restricting that could then allow other threads to be created if you're hitting some sort of native limit. Have you tried with the -XXgcThreads set to something low like 2 or 4?
-XX:ConcGCThreads
only restricts SOME of the garbage collection thread creation so it's worth trying ccThreads.2: Do you have a more complete stacktrace that shows where the threads are being allocated when they fail? Maybe there is a GATK bug that we're accidentally allocating too many somewhere.
3: Is it possible your machine has a low thread limit for some reason? Could you check
cat /proc/sys/kernel/threads-max
? Maybe poking around with some of the suggestions in changing thread default stack sizes in this stackoverflow post could help?Let me know if you have any more information. I haven't ever heard of this happening before so I'm guessing there might be something somewhat unusual about your system setup.
@LouisB Thanks for the response, and for the clarification on the spark-specific parameters. This is really just a placeholder to let you know I've seen this; I'll need a bit of time to dig out the information you asked for. I have worked around the problem for now, simply by running each sample in series instead of in parallel and using the (seriously impressively fast SortSamSpark command). I do want to understand what's happening here, however, so I'll revisit, recreate the issue, and post back here when I get a chance.
I'll try the gcThreads JVM option and regenerate the logs to see if there are more details in the stack trace. IIRC that was the complete trace for each instance of the exception, which suggests that the GC (or some other JVM process not specific to the GATK codebase) is the culprit. As a Java dev of some experience, though, it's difficult for me to imagine the GC being responsible for starting 50+ threads if the core application is only using a handful. (Monitoring in
top
shows the CPU usage of individual instances of GATK consuming 50+ CPUs for periods of the order of several minutes at a time.)It would also be somewhat paradoxical for the GC to throw OOMEs... though I suspect that exception type is spurious and is somehow just a placeholder for "the JVM tried to create too many threads". Anything is possible, though.
Oh,
cat /proc/sys/kernel/threads-max
yields 6,189,701.Thanks for the the reply. I'm glad SortSamSpark is working for you.
I've definitely seen weird pathological garbage collection issues on really big machines before, so that's why I'm so focused on that. Things like spending 3000% cpu on a tiny single core process because garbage collection is contesting itself somehow.
As a side note, if you need to mark duplicates, checkout MarkDuplicatesSpark as well. It should be similarly fast on your setup. It ideally takes in queryname sorted bams and outputs them as position sorted bams so it eliminates a separate sort step if the pipeline is aligned reads -> duplicate marking -> variant calling.