What are the smallest units I can break whole human genomes into, for scatter-gather?
Hi, and thank you so much for the wonderful tools and support!
For our current project, we'd like to run 2000+ whole genomes from FASTQ to VCF using GATK best practices.
I'd like to optimize the runtime, in particular of GenotypeGVCFs.
Previously, we have used -nt with GenotypeGVCFs for parallelism.
With GATK 3.7, using threads (-nt) with GenotypeGVCFs has always crashed due to what appears to be lack of thread safety.
From what I've read on this forum, this is a known issue, and users are urged to use a scatter-gather approach.
If I understand correctly, scatter-gather for GenotypeGVCFs would entail splitting the combined, multi-sample, Whole-Genome gVCFs into, say, combined, multi-sample, Per-Chromosome gVCFs. And then executing GenotypeGVCFs on each multi-sample, chromosomal gVCF, on a cluster, in single-threaded mode?
Please correct me if this understanding is not accurate. I have read the Parallelism and Scatter-Gather pages on the forums.
If my understanding of scatter-gather is accurate, then it seems that to get the best performance when scaling out, you would want to subset the multi-sample, whole-genome gVCFs into as-small-as-reasonably-possible gVCFs, so that you could run hundreds or thousands of them in parallel on the cluster.
E.g. Partition the multi-sample, whole-genome gVCFs into, say, 10kb regions over each chromosome, yielding ~300,000 multi-sample gVCFs. Then you submit those to your batch/queue system and run each as its own invocation of GenotypeGVCFs with -nt 1.
However, the recommendations on this forum, for whole-genome data, tend to be to split at the chromosome level.
This would limit your parallelism to 22 if your were running the human autosomes.
And if there are a large number of high-coverage samples, and you're forced into single thread mode, this will not be efficient.
So, what is the smallest unit one can break-up whole genome data for GenotypeGVCFs?