HaplotypeCaller sensitivity in large(ish) cohorts

pdexheimerpdexheimer Posts: 409Member, GSA Collaborator ✭✭✭✭

One of my projects currently has ~150 patients (exomes) that I've been processing through the standard pipeline (2.8-1, including ReduceReads). In my most recent run through HC, I split the cohort in half for the sake of time. A subset of these patients have undergone targeted genotyping in the clinic, and I have a list of 36 validated variants in 28 samples. When I checked these variants in the final VCF, 5 of 36 were not called by HaplotypeCaller and have moderate to excellent support in the BAM. Several of these (possibly all of them? Not sure) were present in previous HC and UG runs with fewer samples, and I verified that the one I'm focusing on is called correctly when I only use five samples.

Debugging runs on a small region have revealed the following:

  1. ReduceReads does not seem to be the culprit, my variant is still uncalled when using the un-reduced bams
  2. My variant is not inside an Active Region
  3. When I force it to be with -forceActive, it's not in the trimmed ActiveRegion
  4. I've tried increasing -maxNumHaplotypesInPopulation as high as 1024, and the trimmed region still doesn't include my variant
  5. I've also tried running with -dontTrimActiveRegions, but haven't successfully finished yet (runtime increases from 30 seconds to over an hour, I keep trying to run it in short queues while I'm doing other stuff and getting killed by the scheduler)

A couple of other random notes that may or may not be applicable: These are rare variants that I only expect to see in 1 or 2 samples. My testing region is ~400bp around the variant in question. There is a variant in another sample at an immediately adjacent nucleotide that is also not called (and, perhaps obviously, is also outside the active regions).

Do you have any suggestions for approaching this? I haven't messed with -minPruning yet, as increasing that value should result in a loss of sensitivity and reducing it seems like a bad idea. I suppose I could split my cohort into subsets of 30 or 40 samples, but that doesn't seem like the best approach

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,973Administrator, GATK Developer admin

    Hi Phil,

    You shouldn't have to deploy that much effort, so we think this is a case where HC is not doing the right thing. We'd like to help figure this out as it sounds like a good test case to improve sensitivity. Would you be able to share data with us so we can debug locally?

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 409Member, GSA Collaborator ✭✭✭✭

    Probably, but I'll have to clear it with the investigator and put together a coherent test case. I'll be in touch...

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,973Administrator, GATK Developer admin

    Sure -- we can sign some NDA paperwork if necessary.

    Geraldine Van der Auwera, PhD

  • OprahOprah Posts: 26Member

    Maybe Phil doesn't need to share his data, because his problem is very similar to a couple of others from last week: "HaplotypeCaller sensitivity" and "HaplotypeCaller misses a true variant". Geraldine, any progress on those?

  • pdexheimerpdexheimer Posts: 409Member, GSA Collaborator ✭✭✭✭

    Okay, permission obtained and data uploaded (pdexheimer_HC.tgz). I included a README.txt describing the expected and actual results. I promised the PI I would notify you that this data is unpublished and confidential - so don't go writing a Nature paper based on 500bp from a couple dozen random people :)

    Thanks for taking a look, hope this helps track down the bug

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,973Administrator, GATK Developer admin

    Thanks @pdexheimer, that was fast! We'll treat this appropriately and delete the data when we're done. And it would be pretty worrying if Nature accepted a paper like that! Not that I'm naive enough to think it couldn't happen... sigh

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,973Administrator, GATK Developer admin

    @Oprah, we've found the probable cause of the HaplotypeCaller misses a true variant issue. The other issue is still unresolved. These sensitivity issues are typically due to different underlying causes even though the overall symptoms appear to be similar, so we prefer to treat them individually rather than assume they are the same.

    Geraldine Van der Auwera, PhD

  • OprahOprah Posts: 26Member

    Phil, the release notes for 3.0 mention a few bug fixes. Are you going to test the new HC to see if it fixes the issue you reported? Thanks.

  • pdexheimerpdexheimer Posts: 409Member, GSA Collaborator ✭✭✭✭

    I am. Hopefully I can start everything running tonight and have an answer at the beginning of next week

  • pdexheimerpdexheimer Posts: 409Member, GSA Collaborator ✭✭✭✭

    @Oprah, I'm seeing major improvements. I won't do a thorough evaluation until Monday, but the concordance with my validated set is much higher. It's not 100%, but I remember seeing at least one "validated" variant with zero bam support that I would never expect to see called

  • OprahOprah Posts: 26Member

    Thanks for the quick reply Phil. I also see major improvements using HC + GenotypeGVCFs (but not HC alone, without GenotypeGVCFs)

  • pdexheimerpdexheimer Posts: 409Member, GSA Collaborator ✭✭✭✭

    Yes, I definitely consider this matter solved. Of my 38 "confirmed" variants, the v3 pipeline called 35. On closer inspection, one has no read support and shouldn't be called, and two lie outside my analysis regions (I forgot to include the interval_padding argument. This is what happens when you do things in a hurry).

  • ebanksebanks Posts: 684GATK Developer mod

    Great! Thanks for letting us know.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Sign In or Register to comment.