This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Workflow for GenomicsDBImport and GenotypeGVCFs across multiple intervals
We have ~250 WGS samples that I am trying to run through the GATK Best Practices pipeline for germline calls. I'm currently running GATK v18.104.22.168, although the GVCFs were previously generated using HaplotypeCaller v.3.7.
I'm trying to setup a pipeline for GenomicsDBImport and GenotypeGVCFs, as CombineGVCFs under GATK 3.7 was simply not able to process the large number of WGS samples. I read through a couple of posts on this topic already here and here but am still confused on a couple of points in putting the pipeline together.
First, I see that the recommendation for using intervals is to use the same number of intervals as the number of samples--250 in my case. Are these splits for the whole genome or for individual chromosomes--i.e. split each chromosome into 250 intervals, or split the entire genome into 250 intervals (~10 per chromosome)? Also, should any of the intervals overlap each other to covervariants that may fall right at an interval boundary?
Second, from what I understand, both GenomicsDBImport and GenotypeGVCFs should be run separately on each interval, creating 250 (or 250*24 chromosomes) separate databases and VCF files. For the downstream steps in the Best Practices pipeline (i.e. VariantRecalibrator), would a simple merge of the 250 VCF files into one VCF file after GenotypeGVCFs suffice, or are there any other special processing steps recommended?
Lastly, our cluster has 1.25Tb of RAM and 160 cores. Do you have any rough estimates on how much memory to allocate per interval run (which would also determine how many intervals we can run at once)? I can try out a few different values here if you don't have any particular recommendations.
Thank you very much for your time and help!