Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
HaplotypeCaller strategy on a large cohort of Whole Genome samples
We have to perform
HaplotypeCaller variant calling on a cohort of Whole Genome Samples
(~400 samples). Since the region to be called is huge, I was wondering what would be the best way to go about doing the
sample-level-gVCF-calling & there-after
- Should we split the whole-genome region into "several smaller parts"
(For example - 100 BED parts)and then perform gVCF calling for each of those 100 parts for each of the 400 samples
(100 BED parts * 400 samples = 40000 gVCFs)?
- ..and then merge each BED part gVCF from each sample into one final joined VCF for that BED part
- ..and then concatenate each of the joined 100 VCF parts into one final whole-genome VCF file?
Or is there a more efficient way to go about this?
I guess I have not been able to find much information on your forums where folks have been doing the gVCF calling on a smaller BED region and then stitching together those regions' gVCFs into one giant gVCF or VCF. I am aware that there is a
-L option available in the
HaplotypeCaller Module, but I am not sure what are the recommended best practices for using that option when it comes to gVCF calling.