The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at

Per-sample parallelism

armenarmen Member Posts: 18

Queue allows for "per-region" parallelism using scatter-gather. However, not all GATK tools support this (e.g. RealignerTargetCreator), and not all tools in a pipeline are GATK tools (e.g. BWA).

What I would like to do in the 1st phase of the best-practice pipeline is "per-sample" parallelism, that is, process each sample in parallel on a separate cluster node. Is there a recommended way to do this?


Best Answer


  • armenarmen Member Posts: 18

    Running a script in multiple cluster nodes would lead to an I/O bottleneck because multiple nodes would concurrently access the network location where the data are stored.

    To overcome this, the input files could be transferred to the node before processing and the output files back to the original location. However, synchronization is still needed to ensure that only one transfer is done at a time.

    I was wondering what strategies people use to tackle these I/O issues, and if there is any tool to do this automatically without resorting to writing custom synchronization code.

Sign In or Register to comment.