Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Per-sample parallelism

armenarmen Posts: 18Member

Queue allows for "per-region" parallelism using scatter-gather. However, not all GATK tools support this (e.g. RealignerTargetCreator), and not all tools in a pipeline are GATK tools (e.g. BWA).

What I would like to do in the 1st phase of the best-practice pipeline is "per-sample" parallelism, that is, process each sample in parallel on a separate cluster node. Is there a recommended way to do this?


Best Answer


  • armenarmen Posts: 18Member

    Running a script in multiple cluster nodes would lead to an I/O bottleneck because multiple nodes would concurrently access the network location where the data are stored.

    To overcome this, the input files could be transferred to the node before processing and the output files back to the original location. However, synchronization is still needed to ensure that only one transfer is done at a time.

    I was wondering what strategies people use to tackle these I/O issues, and if there is any tool to do this automatically without resorting to writing custom synchronization code.

Sign In or Register to comment.