If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Workflow for GenomicsDBImport and GenotypeGVCFs across multiple intervals
We have ~250 WGS samples that I am trying to run through the GATK Best Practices pipeline for germline calls. I'm currently running GATK v188.8.131.52, although the GVCFs were previously generated using HaplotypeCaller v.3.7.
I'm trying to setup a pipeline for GenomicsDBImport and GenotypeGVCFs, as CombineGVCFs under GATK 3.7 was simply not able to process the large number of WGS samples. I read through a couple of posts on this topic already here and here but am still confused on a couple of points in putting the pipeline together.
First, I see that the recommendation for using intervals is to use the same number of intervals as the number of samples--250 in my case. Are these splits for the whole genome or for individual chromosomes--i.e. split each chromosome into 250 intervals, or split the entire genome into 250 intervals (~10 per chromosome)? Also, should any of the intervals overlap each other to covervariants that may fall right at an interval boundary?
Second, from what I understand, both GenomicsDBImport and GenotypeGVCFs should be run separately on each interval, creating 250 (or 250*24 chromosomes) separate databases and VCF files. For the downstream steps in the Best Practices pipeline (i.e. VariantRecalibrator), would a simple merge of the 250 VCF files into one VCF file after GenotypeGVCFs suffice, or are there any other special processing steps recommended?
Lastly, our cluster has 1.25Tb of RAM and 160 cores. Do you have any rough estimates on how much memory to allocate per interval run (which would also determine how many intervals we can run at once)? I can try out a few different values here if you don't have any particular recommendations.
Thank you very much for your time and help!