Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How do I limit thread usage (specifically for SortSam, but I may need this more generally)

jdenvirjdenvir Marshall UniversityMember

I am trying to use GATK 4.1.3.0 in a fairly "old-school" environment to process some whole-exome sequencing reads and ultimately call variants. The environment is a single server node with reasonably large RAM (768GB) and 64 CPUs. I have 61 samples which were split over multiple lanes (and runs, in some cases), and consequently I have 238 bam files after aligning each pair of fastqs with BWA. The plan is to combine these bams into a single bam per sample at the step when I mark duplicate reads.

Before that, I need to sort the bam files, which I'm trying to do with gatk SortSam. I'm doing this with a simple shell script which batches these into groups of 34 and runs 34 processes in parallel. I'm limiting the memory per process so I am well within the limits of the box. Essentially this looks like:

GATK=/opt/gatk-4.1.3.0/gatk
MAX_JVMS=34 # 238 bam files gives seven batches
MAX_MEM=8g

# original bam files are in subfolders (one per run) of aligned:
source_dir=aligned
dest_dir=sorted

# recreate same directory structure under destination directory:
for run_dir in ${source_dir}/* ; do
  IFS='/' read src run <<< "$run_dir"
  mkdir -p ${dest_dir}/$run
done

source_files=(${source_dir}/*/*.bam)
num_files=${#source_files[@]}

start=0

while [ $start -lt $num_files ] ; do
  # slice array so we only run $MAX_JVMS processes at a time:
  for f in ${source_files[@]:start:MAX_JVMS} ; do
    echo "Sorting $f"
    $GATK --java-options -Xmx${MAX_MEM} SortSam -I $f -O ${f/$source_dir/$dest_dir} --SORT_ORDER coordinate --CREATE_INDEX &
  done
  wait
  start=$[ start + MAX_JVMS ]
done 

I can verify that there are only at most 34 of these running at once, and that the memory consumption is not an issue. The problem is that each instance of GATK is creating multiple threads, and consequently I am ending up with thread starvation issues, and I'm seeing errors of the form

Exception in thread "ForkJoinPool.commonPool-worker-27" java.lang.OutOfMemoryError: unable to create new native thread
    at java.lang.Thread.start0(Native Method)
    at java.lang.Thread.start(Thread.java:717)
    at java.util.concurrent.ForkJoinPool.createWorker(ForkJoinPool.java:1486)
    at java.util.concurrent.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1517)
    at java.util.concurrent.ForkJoinPool.signalWork(ForkJoinPool.java:1634)
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1733)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1691)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

in the logs (with the corresponding instance of gatk failing). Again, this isn't a heap memory allocation error, it's a native thread allocation error. As far as I can tell, each instance of GATK seems to be assuming it has all 64 CPUS to play with, and is trying to allocate threads accordingly.

I can't find any options to limit the number of threads for each individual process. I'm looking for something equivalent to the [email protected] option in samtools, etc.

While this is currently specific to SortSam, I anticipate I'm going to need similar functionality throughout the pipeline.

I'm aware that the latest version of GATK is really aimed at somewhat different architectures (HPC clusters where each process effectively has its own node on which to run, either as a standalone cluster or one which is cloud-based); however this is the environment in which I am currently constrained to run. I was able to make these pipelines work with GATK3 (and earlier) in this environment, but haven't been able to do so with GATK4.

Answers

  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @jdenvir ,

    Try reviewing the following article, also possibly implementing other useful java options here.

  • jdenvirjdenvir Marshall UniversityMember

    Hi @bshifaw ,

    Thanks for the response. I tried this using variations of the following:

    gatk SortSamSpark -I aligned.bam -O sorted.bam -SO queryname --num-executors 1--executor-cores 4
    

    which complains that --num-executors and --executor-cores are not valid options.

    Using, e.g.

    gatk SortSamSpark -I aligned.bam -O sorted.bam -SO queryname --conf 'spark.executor.cores=4' 
    

    runs, but shows the same behavior, i.e. it just grabs as many threads as possible (judging by CPU usage). This is the same with or without the -XX:ConcGCThreads option to the JVM. (It is impressively fast, though...)

    I see the same thing with MarkDuplicatesSpark. Am I missing how to use these properties correctly?

    Thanks,
    Jim

  • bshifawbshifaw Member, Broadie, Moderator admin

    I'll check with the dev team and get back to you.

  • jdenvirjdenvir Marshall UniversityMember

    Thanks @bshifaw

    In case they need version info, etc:

    $/opt/gatk-4.1.3.0/gatk --version
    Using GATK jar /opt/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar --version
    The Genome Analysis Toolkit (GATK) v4.1.3.0
    HTSJDK Version: 2.20.1
    Picard Version: 2.20.5
    
    $ java -version
    openjdk version "1.8.0_222"
    OpenJDK Runtime Environment (build 1.8.0_222-b10)
    OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
    
    $uname -a
    Linux htseq.marshall.edu 3.10.0-957.12.1.el7.x86_64 #1 SMP Tue Apr 23 13:49:21 CDT 2019 x86_64 x86_64 x86_64 GNU/Linux
    
    $ cat /etc/os-release 
    NAME="Scientific Linux"
    VERSION="7.6 (Nitrogen)"
    ID="scientific"
    ID_LIKE="rhel centos fedora"
    VERSION_ID="7.6"
    PRETTY_NAME="Scientific Linux 7.6 (Nitrogen)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:scientificlinux:scientificlinux:7.6:GA"
    HOME_URL="http://www.scientificlinux.org//"
    BUG_REPORT_URL="mailto:[email protected]"
    
    REDHAT_BUGZILLA_PRODUCT="Scientific Linux 7"
    REDHAT_BUGZILLA_PRODUCT_VERSION=7.6
    REDHAT_SUPPORT_PRODUCT="Scientific Linux"
    REDHAT_SUPPORT_PRODUCT_VERSION="7.6"
    
  • bshifawbshifaw Member, Broadie, Moderator admin

    jdenvir

    1. The team suggested to double check the memory while your running the command.
      free -h --si -s 5 > memory.txt and run it in the background with &

    2. try -XXgcThreads as a java option

    Format: -XXgcthreads:<# threads>
    

    In order to use --num-executors and --executor-cores you would have to setup a master node locally and use the following parameter
    --SparkMaster local[number of cores]

  • jdenvirjdenvir Marshall UniversityMember

    Thanks @bshifaw

    Memory is definitely not the issue; nor is the GC consuming anything like that many threads.

    In order to use --num-executors and --executor-cores you would have to setup a master node locally

    So does this mean there's no way to control thread allocation without setting up a local Spark framework? That seems like a pretty huge dependency for fairly basic functionality. All the built-in Java executor services (e.g. ForkJoinPool) allow a level of parallelism to be specified completely independently of the environment.

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭

    Hi @jdenvir. I'm one of the GATK devs.

    I wanted to step in and help clarify the situation with spark core usage. The --num-executors and --executor-cores spark arguments are only relevant if you're running a spark cluster. The way you are running spark is as a stand alone spark execution within a single process. The way to control the number of cores in that case is by specifying the spark master as --spark-master 'LOCAL[4]' where the number in brackets [] is the number of cores you want. The default is * which is all available cores which explains what you're seeing.

    That's why the GATK doesn't accept --num-executors when running in local mode. (You can as you discovered pass them as spark arguments, but they just get ignored then.) It's admittedly confusing and could probably be better explained. There is some documentation about it here for future reference.

    I'm not sure what's going on with the original thread memory issue you started this thread about. I have a few things to try:
    1: GATK shouldn't use very many threads, with exception of the garbage collection threads. However, garbage collection can allocate 1 thread per core per java process, so it seems possible that restricting that could then allow other threads to be created if you're hitting some sort of native limit. Have you tried with the -XXgcThreads set to something low like 2 or 4? -XX:ConcGCThreads only restricts SOME of the garbage collection thread creation so it's worth trying ccThreads.

    2: Do you have a more complete stacktrace that shows where the threads are being allocated when they fail? Maybe there is a GATK bug that we're accidentally allocating too many somewhere.

    3: Is it possible your machine has a low thread limit for some reason? Could you check cat /proc/sys/kernel/threads-max? Maybe poking around with some of the suggestions in changing thread default stack sizes in this stackoverflow post could help?

    Let me know if you have any more information. I haven't ever heard of this happening before so I'm guessing there might be something somewhat unusual about your system setup.

  • jdenvirjdenvir Marshall UniversityMember

    @LouisB Thanks for the response, and for the clarification on the spark-specific parameters. This is really just a placeholder to let you know I've seen this; I'll need a bit of time to dig out the information you asked for. I have worked around the problem for now, simply by running each sample in series instead of in parallel and using the (seriously impressively fast SortSamSpark command). I do want to understand what's happening here, however, so I'll revisit, recreate the issue, and post back here when I get a chance.

    I'll try the gcThreads JVM option and regenerate the logs to see if there are more details in the stack trace. IIRC that was the complete trace for each instance of the exception, which suggests that the GC (or some other JVM process not specific to the GATK codebase) is the culprit. As a Java dev of some experience, though, it's difficult for me to imagine the GC being responsible for starting 50+ threads if the core application is only using a handful. (Monitoring in top shows the CPU usage of individual instances of GATK consuming 50+ CPUs for periods of the order of several minutes at a time.)

    It would also be somewhat paradoxical for the GC to throw OOMEs... though I suspect that exception type is spurious and is somehow just a placeholder for "the JVM tried to create too many threads". Anything is possible, though.

    Oh, cat /proc/sys/kernel/threads-max yields 6,189,701.

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭

    Thanks for the the reply. I'm glad SortSamSpark is working for you.

    I've definitely seen weird pathological garbage collection issues on really big machines before, so that's why I'm so focused on that. Things like spending 3000% cpu on a tiny single core process because garbage collection is contesting itself somehow.

    As a side note, if you need to mark duplicates, checkout MarkDuplicatesSpark as well. It should be similarly fast on your setup. It ideally takes in queryname sorted bams and outputs them as position sorted bams so it eliminates a separate sort step if the pipeline is aligned reads -> duplicate marking -> variant calling.

Sign In or Register to comment.