Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

HaplotypeCaller strategy on a large cohort of Whole Genome samples

We have to perform HaplotypeCaller variant calling on a cohort of Whole Genome Samples (~400 samples). Since the region to be called is huge, I was wondering what would be the best way to go about doing the sample-level-gVCF-calling & there-after GenotypeGVFs step.

  1. Should we split the whole-genome region into "several smaller parts" (For example - 100 BED parts) and then perform gVCF calling for each of those 100 parts for each of the 400 samples (100 BED parts * 400 samples = 40000 gVCFs)?
  2. ..and then merge each BED part gVCF from each sample into one final joined VCF for that BED part (100 VCFs)?
  3. ..and then concatenate each of the joined 100 VCF parts into one final whole-genome VCF file?

Or is there a more efficient way to go about this?

I guess I have not been able to find much information on your forums where folks have been doing the gVCF calling on a smaller BED region and then stitching together those regions' gVCFs into one giant gVCF or VCF. I am aware that there is a -L option available in the HaplotypeCaller Module, but I am not sure what are the recommended best practices for using that option when it comes to gVCF calling.

Shalabh Suman

Best Answer


  • shalabhsumanshalabhsuman NIHMember

    Anything on this??

Sign In or Register to comment.