If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Detailed documentation of how GATK tools employ SPARK
It's been a while since GATK4 is out and Spark tools got introduced (yeyyy:)), but so far I haven't been able to find a good link to read on how exactly GATK employs it.
If you could fill these pages with some content would be great (single multi core,spark cluster). Particularly, I'm interested to know how the jobs are managed like: if running locally with for instance
local, how does Haplotype Caller traverses the data ? Does the Active Region traversal still applies for the SPARK tools ? What about the concept of Walkers? How many blocks of data each Spark RDD contains ? Have you done some tests to improve performance, or mostly rely on default Spark settings to manage parallelism ?