If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
What are the smallest units I can break whole human genomes into, for scatter-gather?
Hi, and thank you so much for the wonderful tools and support!
For our current project, we'd like to run 2000+ whole genomes from FASTQ to VCF using GATK best practices.
I'd like to optimize the runtime, in particular of GenotypeGVCFs.
Previously, we have used -nt with GenotypeGVCFs for parallelism.
With GATK 3.7, using threads (-nt) with GenotypeGVCFs has always crashed due to what appears to be lack of thread safety.
From what I've read on this forum, this is a known issue, and users are urged to use a scatter-gather approach.
If I understand correctly, scatter-gather for GenotypeGVCFs would entail splitting the combined, multi-sample, Whole-Genome gVCFs into, say, combined, multi-sample, Per-Chromosome gVCFs. And then executing GenotypeGVCFs on each multi-sample, chromosomal gVCF, on a cluster, in single-threaded mode?
Please correct me if this understanding is not accurate. I have read the Parallelism and Scatter-Gather pages on the forums.
If my understanding of scatter-gather is accurate, then it seems that to get the best performance when scaling out, you would want to subset the multi-sample, whole-genome gVCFs into as-small-as-reasonably-possible gVCFs, so that you could run hundreds or thousands of them in parallel on the cluster.
E.g. Partition the multi-sample, whole-genome gVCFs into, say, 10kb regions over each chromosome, yielding ~300,000 multi-sample gVCFs. Then you submit those to your batch/queue system and run each as its own invocation of GenotypeGVCFs with -nt 1.
However, the recommendations on this forum, for whole-genome data, tend to be to split at the chromosome level.
This would limit your parallelism to 22 if your were running the human autosomes.
And if there are a large number of high-coverage samples, and you're forced into single thread mode, this will not be efficient.
So, what is the smallest unit one can break-up whole genome data for GenotypeGVCFs?