Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
HaplotypeCaller sensitivity in large(ish) cohorts
One of my projects currently has ~150 patients (exomes) that I've been processing through the standard pipeline (2.8-1, including ReduceReads). In my most recent run through HC, I split the cohort in half for the sake of time. A subset of these patients have undergone targeted genotyping in the clinic, and I have a list of 36 validated variants in 28 samples. When I checked these variants in the final VCF, 5 of 36 were not called by HaplotypeCaller and have moderate to excellent support in the BAM. Several of these (possibly all of them? Not sure) were present in previous HC and UG runs with fewer samples, and I verified that the one I'm focusing on is called correctly when I only use five samples.
Debugging runs on a small region have revealed the following:
- ReduceReads does not seem to be the culprit, my variant is still uncalled when using the un-reduced bams
- My variant is not inside an Active Region
- When I force it to be with -forceActive, it's not in the trimmed ActiveRegion
- I've tried increasing -maxNumHaplotypesInPopulation as high as 1024, and the trimmed region still doesn't include my variant
- I've also tried running with -dontTrimActiveRegions, but haven't successfully finished yet (runtime increases from 30 seconds to over an hour, I keep trying to run it in short queues while I'm doing other stuff and getting killed by the scheduler)
A couple of other random notes that may or may not be applicable: These are rare variants that I only expect to see in 1 or 2 samples. My testing region is ~400bp around the variant in question. There is a variant in another sample at an immediately adjacent nucleotide that is also not called (and, perhaps obviously, is also outside the active regions).
Do you have any suggestions for approaching this? I haven't messed with -minPruning yet, as increasing that value should result in a loss of sensitivity and reducing it seems like a bad idea. I suppose I could split my cohort into subsets of 30 or 40 samples, but that doesn't seem like the best approach