Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

How can I use parallelism to make GATK tools run faster?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,202Administrator, GSA Member admin
edited April 2013 in FAQs

This document provides technical details and recommendations on how the parallelism options offered by the GATK can be used to yield optimal performance results.

Overview

As explained in the primer on parallelism for the GATK, there are two main kinds of parallelism that can be applied to the GATK: multi-threading and scatter-gather (using Queue).

Multi-threading options

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively, which can be combined:

  • -nt / --num_threads controls the number of data threads sent to the processor
  • -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread

For more information on how these multi-threading options work, please read the primer on parallelism for the GATK.

Memory considerations for multi-threading

Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4, the multithreaded run will use 8 Gb of memory. In contrast, CPU threads will share the memory allocated to their “mother” data thread, so you don’t need to worry about allocating memory based on the number of CPU threads you use.

Additional consideration when using -nct with versions 2.2 and 2.3

Because of the way the -nct option was originally implemented, in versions 2.2 and 2.3, there is one CPU thread that is reserved by the system to “manage” the rest. So if you use -nct, you’ll only really start seeing a speedup with -nct 3 (which yields two effective "working" threads) and above. This limitation has been resolved in the implementation that will be available in versions 2.4 and up.

Scatter-gather

For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.

Applicability of parallelism to the major GATK tools

Please note that not all tools support all parallelization modes. The parallelization modes that are available for each tool depend partly on the type of traversal that the tool uses to walk through the data, and partly on the nature of the analyses it performs.

Tool Full name Type of traversal NT NCT SG
RTC RealignerTargetCreator RodWalker + - -
IR IndelRealigner ReadWalker - - +
BR BaseRecalibrator LocusWalker - + +
PR PrintReads ReadWalker - + -
RR ReduceReads ReadWalker - - +
UG UnifiedGenotyper LocusWalker + + +

Recommended configurations

The table below summarizes configurations that we typically use for our own projects (one per tool, except we give three alternate possibilities for the UnifiedGenotyper). The different values allocated for each tool reflect not only the technical capabilities of these tools (which options are supported), but also our empirical observations of what provides the best tradeoffs between performance gains and commitment of resources. Please note however that this is meant only as a guide, and that we cannot give you any guarantee that these configurations are the best for your own setup. You will probably have to experiment with the settings to find the configuration that is right for you.

Tool RTC IR BR PR RR UG
Available modes NT SG NCT,SG NCT SG NT,NCT,SG
Cluster nodes 1 4 4 1 4 4 / 4 / 4
CPU threads (-nct) 1 1 8 4-8 1 3 / 6 / 24
Data threads (-nt) 24 1 1 1 1 8 / 4 / 1
Memory (Gb) 48 4 4 4 4 32 / 16 / 4

Where NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather using Queue. For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • xpastorxpastor Posts: 3Member
    edited February 2013

    Hi,

    I have access to a cluster with 8 nodes, each node with 64 Gb RAM and 8 cores. I'm trying to process 30 samples using UnifiedGenotyper. Each sample consists of an exome of around 62e6 bases and has an average coverage around 60x. Can you give me any advice on nt/nct configuration in order to optimize the performance of the execution, using and not using Queue?

    Thanks, Xavier

    Post edited by xpastor on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,202Administrator, GSA Member admin

    Hi Xavier,

    We don't have the resources right now to give case-by-case advice on configurations, so you'll need to experiment with your setup based on the general guidelines in the article. You may want to discuss it with the people who manage your cluster, as they may also have some helpful insights. Good luck!

    Geraldine Van der Auwera, PhD

  • trgalltrgall Posts: 13Member

    On the DepthOfCoverage page http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html it says DepthOfCoverage supports -nt, but when run the walker says it does not support parallel execution. Is there a quick way to find out (before running) what tools support each option?

    Also, some of the nodes I use have hyperthreading. Should I double the -nt (or -nct), or just use the physical number of cores?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,202Administrator, GSA Member admin

    The tech doc is the best way to find out whether a tool supports parallelism or not. If it's listed as TreeReducible, DoC should support -nt. Can you please post the command line you tried and the error message that you got?

    Geraldine Van der Auwera, PhD

  • ecyehecyeh Posts: 8Member

    Hi, We cannot run DepthOfCoverage with -nt either, have tried both GATK 2.3-4 and 2.5-2. The command line is: java -Xmx2g -jar ./GenomeAnalysisTK.jar -R /repo/ref/ref.fasta -T DepthOfCoverage -o coverage_out_nt4 -I ./data/Sample1.bam -nt 2 And got error message:

    ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis DepthOfCoverage aggregates results by interval. Due to a current limitation of the GATK, analyses of this type do not currently support parallel execution. Please run your analysis without the -nt option.
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,202Administrator, GSA Member admin
  • rsArgMar12rsArgMar12 Posts: 1Member

    Hi,

    I am working on the parallelization of a Whole-exome pipeline. I split a bam file by chromosome and perform recalibration and realignment separately on 24 bam files. However, I have noticed that I am getting slightly different results in terms of variant calls when compare to the bam file produced by a pipeline without parallelization. I have noticed that the difference appears at the realignment step were the regions look differently aligned. I am wondering whether this is due to the fact that RealignerTargetCreator and IndelRealigner are not supposed properly with the scatter-gather technique (we are not using Queue to perform scatter-gather) Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,202Administrator, GSA Member admin

    Hi rsArgMar12,

    Can you please post a couple of screenshots showing what are the differences you see?

    Geraldine Van der Auwera, PhD

  • TristanTristan La Jolla, CAPosts: 11Member

    Hello there! Was wondering if you could update this wonderful post to include recommendations for the HaplotypeCaller?

  • armenarmen Posts: 18Member

    Hi,

    does the above recommendation of 4 cluster nodes assume that there are exactly 4 nodes available and therefore, if more nodes were available, they could be also used? Or does it mean that even if more than 4 nodes are available, using more that 4 nodes would actually slow down the process? (due to I/O issues maybe)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,202Administrator, GSA Member admin

    Hi @Tristan,

    Sorry to get back to you so late, your comment seems to have slipped through my net. We're working on a set of new docs for HC, so I'll include an update for this doc as well. Can't promise an ETA though, might be a week or two before we get to it.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,202Administrator, GSA Member admin

    Hi @armen,

    Your first hypothesis is correct -- the example assumes 4 nodes. If you have more, feel free to use more. In our hands it takes a lot more before we see any I/O issues. But that can depend on your platform, so you may want to experiment a little before launching any important jobs.

    Geraldine Van der Auwera, PhD

  • alirezakjalirezakj Posts: 48Member
    edited October 2013

    I have an i7 6 core CPU that with hyper-threading gives me 12 cores of CPU and I have 64 GB of DDR3 RAM. I want to run UG as fast as possible for this system what would be the max -nt and -nct for this system. Would this command make sense:
    java -jar GenomeAnalysisTK.jar \ -R resources/Homo_sapiens_assembly18.fasta \ -T UnifiedGenotyper \ -I sample1.bam [-I sample2.bam ...] \ --dbsnp dbSNP.vcf \ -o snps.raw.vcf \ -nt 12\ -nct 6\

    Thanks

    Post edited by alirezakj on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,202Administrator, GSA Member admin

    Hi @alirezakj,

    We can't provide specific recommendations for determining what multithreading values make sense for individual systems; you'll need to experiment and figure it out on your own, sorry.

    Geraldine Van der Auwera, PhD

  • alirezakjalirezakj Posts: 48Member

    Thank you Geraldian for the response, I totally understand why you don't provide any recommendations. As I know if people use close to 100% of their CPU (CPU usage) they might face some issues. (i.e, high CPU temp., crashes in the middle of their jobs, slowing of their system if they need to run something else and so on...)

    I figured out a better way to adjust my variant calling in an efficient way for my system and for ME! For those of you who are interested in adjusting your -nt and -nct according to your system in a faster way rather than trying to run different commands with different -nt values and comparing the time! I have a recommendation.

    If you install a tool like sensors (Command line)/psensor (GUI) you can monitor your CPU temp. and CUP usage in real time. I have 12 cores of CPU (hyper-threaded i7 6 core 3.9 GHz, AND 64 GB RAM) and if I run UG with -nt 6 I get around 50% CPU usage and 55 Celsius degrees tempt at each of my 6 physical cores. (I have a good cooling system so if my CPU usage is about 100% I still don't have over heating problem)
    However, I personally don't like running a single command that uses about 85% of my CPU, so -nt 6 is perfect for me as it uses 50% and I don't have any over heat! (By the way, for those who have speed fever, I don't recommend over clocking your CPU for a bioinformatic job that might run more than 5 hours)

    Good luck with adjusting your -nt value :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,202Administrator, GSA Member admin

    Thanks for reporting this, @alirezakj -- I'm sure it will be useful for others.

    Geraldine Van der Auwera, PhD

  • alirezakjalirezakj Posts: 48Member

    Thank YOU Geraldine for all of your efforts! By the way, I forgot to report that UG without -nt value (default) is using 9% of my CPU and at 37 C temp! So -nt certainly helps a lot thanks to GATK developing team! Those who are very interested in speed, please have in mind that if you have a slow HDD then you have a bottleneck of speed right there and -nt might not be so helpful after a certain value! I have a 20 TB RAID 0 (5 x 4 TB western digital black, each with 7200 RPM speed, SATA 6.0 Gb/s, 64Mb cache) HDD space that make me very fast in RAID mode!

  • TristanTristan La Jolla, CAPosts: 11Member

    Howdy GATK folk! @Geraldine_VdAuwera - Thought you and the crowd here might benefit, we've found that PrintReads seems to stop scaling after about -nt 8 and doesn't seem to need much ram for those 8, a java heap of around 8 (-Xmx8g) seems to be enough. Might just be our fun SSD heavy HPC system though (speaking of disk speed).

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,202Administrator, GSA Member admin

    Thanks for reporting that, @Tristan! Sounds about right for PrintReads -- not a lot of processing going on there, the burden is mostly all I/O, unless you're recalibrating bases, which does take a little more for the calculations (but nothing crazy).

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.