HaplotypeCaller on mixed SureSelect and HaloPlex sets

I am preparing to run the Best Practices variant calling pipeline on a large set of samples. These samples were captured with several technologies, some with SureSelect, some with Nextera, and some with HaloPlex. When running the HaplotypeCaller on the GVCFs, do you have any recommendations for how to handle the overlapping and non-overlapping regions between these capture technologies?

Option A is to set the range to only those regions covered by all technologies. I see this as the safest, most conservative. However, this abandons potential data by limiting to the smallest set of capture ranges.

Option B is to set the range to the largest capture ranges. Basically, analyze across the largest range captured by any technology. My question here is whether the statistics used by HC, and then later VQSR are going to be affected by having regions that are of differing coverage across sets of samples. For example given Samples 1, 2, and 3, captured by SureSelect, Nextera, and HaloPlex, respectively. And given regions A, B and C, where A is SureSelect unique, B is across all platforms and C is HaloPlex unique. Will it be a problem that Sample1 has coverage in A-B, Sample2 has coverage only in B, and Sample3 has coverage in B-C? Will the statistics run properly for variants falling in A, B and C?

Option C is the most annoying. I calculate the Venn diagram of the capture ranges and turn the crank on the best practices once for each Venn cell, then combine the VCFs after VQSR filtering. Basically, with SS, N, HP ranges I run the best practices for the overlapping ranges in:
1. SS+N+HP (same as option A ranges)
2. SS+N
3. SS+NP
4. N+HP
5. SS
6. N
7. HP
One of my concerns with this method is that for the singleton ranges the sample numbers X capture range may get too small for VQSR to be effective.

What are your thoughts on how to handle this?

Thanks
Alex

Answers

  • AlexanderHolmanAlexanderHolman DFCIMember

    Or...
    Option 4 could be to run pipeline 3 times, once for samples in each capture set. There should be sufficient numbers of samples in each of the SS, N and HP sets to run VQSR. After filtration, I'd combine the VCFs. My concern here is that brings up all the complexities in combining variants that may occur at identical locations.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @AlexanderHolman
    Hi Alex,

    Have a look at the GVCF workflow. http://gatkforums.broadinstitute.org/discussion/3893/calling-variants-on-cohorts-of-samples-using-the-haplotypecaller-in-gvcf-mode The GVCF workflow is designed to deal with cases like yours. You will run Haplotype Caller on each of your samples separately in GVCF mode, so all the regions covered by each sample will be accounted for. Then, you will run GenotypeGVCFs on all the individual GVCFs. If any of the samples do not have coverage at some of the variant sites in other samples, you will see a ./. (for no-call) meaning there was no data at the site.

    How many samples do you have?

    -Sheila

  • AlexanderHolmanAlexanderHolman DFCIMember

    I have already implemented the GVCF workflow. Am I correct then that when using GVCFs, any depth bias from differing capture technologies, and range bias from differing capture windows will be accounted for in both the base recalibration and GVCF calling. As I'm reading your answer, proposed option B above, using the overall largest capture ranges would be ideal.

    My dataset is roughly 500 samples.

    Thanks,
    Alex

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Not all bias will be eliminated, but the pros of joint calling outweigh the cons of mixing different capture technologies. As long as it's all exome capture. If you were to mix exome and whole genome, then you get into more trouble at the VQSR stage because there's too much difference in the annotation profiles.

Sign In or Register to comment.