HaplotypeCaller on mixed SureSelect and HaloPlex sets
I am preparing to run the Best Practices variant calling pipeline on a large set of samples. These samples were captured with several technologies, some with SureSelect, some with Nextera, and some with HaloPlex. When running the HaplotypeCaller on the GVCFs, do you have any recommendations for how to handle the overlapping and non-overlapping regions between these capture technologies?
Option A is to set the range to only those regions covered by all technologies. I see this as the safest, most conservative. However, this abandons potential data by limiting to the smallest set of capture ranges.
Option B is to set the range to the largest capture ranges. Basically, analyze across the largest range captured by any technology. My question here is whether the statistics used by HC, and then later VQSR are going to be affected by having regions that are of differing coverage across sets of samples. For example given Samples 1, 2, and 3, captured by SureSelect, Nextera, and HaloPlex, respectively. And given regions A, B and C, where A is SureSelect unique, B is across all platforms and C is HaloPlex unique. Will it be a problem that Sample1 has coverage in A-B, Sample2 has coverage only in B, and Sample3 has coverage in B-C? Will the statistics run properly for variants falling in A, B and C?
Option C is the most annoying. I calculate the Venn diagram of the capture ranges and turn the crank on the best practices once for each Venn cell, then combine the VCFs after VQSR filtering. Basically, with SS, N, HP ranges I run the best practices for the overlapping ranges in:
1. SS+N+HP (same as option A ranges)
One of my concerns with this method is that for the singleton ranges the sample numbers X capture range may get too small for VQSR to be effective.
What are your thoughts on how to handle this?