HaplotypeCaller strategy on a large cohort of Whole Genome samples
We have to perform
HaplotypeCaller variant calling on a cohort of Whole Genome Samples
(~400 samples). Since the region to be called is huge, I was wondering what would be the best way to go about doing the
sample-level-gVCF-calling & there-after
- Should we split the whole-genome region into "several smaller parts"
(For example - 100 BED parts)and then perform gVCF calling for each of those 100 parts for each of the 400 samples
(100 BED parts * 400 samples = 40000 gVCFs)?
- ..and then merge each BED part gVCF from each sample into one final joined VCF for that BED part
- ..and then concatenate each of the joined 100 VCF parts into one final whole-genome VCF file?
Or is there a more efficient way to go about this?
I guess I have not been able to find much information on your forums where folks have been doing the gVCF calling on a smaller BED region and then stitching together those regions' gVCFs into one giant gVCF or VCF. I am aware that there is a
-L option available in the
HaplotypeCaller Module, but I am not sure what are the recommended best practices for using that option when it comes to gVCF calling.