The frontline support team will be slow on the forum because we are occupied with the GATK Workshop on March 21st and 22nd 2019. We will be back and more available to answer questions on the forum on March 25th 2019.
What are the smallest units I can break whole human genomes into, for scatter-gather?
Hi, and thank you so much for the wonderful tools and support!
For our current project, we'd like to run 2000+ whole genomes from FASTQ to VCF using GATK best practices.
I'd like to optimize the runtime, in particular of GenotypeGVCFs.
Previously, we have used -nt with GenotypeGVCFs for parallelism.
With GATK 3.7, using threads (-nt) with GenotypeGVCFs has always crashed due to what appears to be lack of thread safety.
From what I've read on this forum, this is a known issue, and users are urged to use a scatter-gather approach.
If I understand correctly, scatter-gather for GenotypeGVCFs would entail splitting the combined, multi-sample, Whole-Genome gVCFs into, say, combined, multi-sample, Per-Chromosome gVCFs. And then executing GenotypeGVCFs on each multi-sample, chromosomal gVCF, on a cluster, in single-threaded mode?
Please correct me if this understanding is not accurate. I have read the Parallelism and Scatter-Gather pages on the forums.
If my understanding of scatter-gather is accurate, then it seems that to get the best performance when scaling out, you would want to subset the multi-sample, whole-genome gVCFs into as-small-as-reasonably-possible gVCFs, so that you could run hundreds or thousands of them in parallel on the cluster.
E.g. Partition the multi-sample, whole-genome gVCFs into, say, 10kb regions over each chromosome, yielding ~300,000 multi-sample gVCFs. Then you submit those to your batch/queue system and run each as its own invocation of GenotypeGVCFs with -nt 1.
However, the recommendations on this forum, for whole-genome data, tend to be to split at the chromosome level.
This would limit your parallelism to 22 if your were running the human autosomes.
And if there are a large number of high-coverage samples, and you're forced into single thread mode, this will not be efficient.
So, what is the smallest unit one can break-up whole genome data for GenotypeGVCFs?