Best practices for joint genotyping of a very large sample size
We are going to variant call 25k+ WGS samples soon. We want to adopt the joint genotyping pipeline provided at https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4-local.wdl & https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4-local.hg38.wgs.inputs.json.
1. One problem is that this pipeline uses a gvcf file, and gvcf is much bigger than vcf in size. So I am not sure if it is practical to have a gvcf file for 25k+ samples.
2. Another problem is memory usage. Can we joint genotype 25k+ WGS samples at once?
The above being said, I am wondering if we could divide the 25k+ samples into smaller groups (e.g. 1000 samples each group), do joint genotyping group by group, without compromising variant calling quality too much. By dividing, we should save space, memory, and time.
BTW, where can I find the gvcfs such as "/home/bshifaw/data/joint_discovery/NA12878.g.vcf.gz" and ""/home/bshifaw/data/joint_discovery/NA12878.g.vcf.gz.tbi"?