Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

HaplotypeCaller sensitivity in large(ish) cohorts

pdexheimerpdexheimer Member ✭✭✭✭

One of my projects currently has ~150 patients (exomes) that I've been processing through the standard pipeline (2.8-1, including ReduceReads). In my most recent run through HC, I split the cohort in half for the sake of time. A subset of these patients have undergone targeted genotyping in the clinic, and I have a list of 36 validated variants in 28 samples. When I checked these variants in the final VCF, 5 of 36 were not called by HaplotypeCaller and have moderate to excellent support in the BAM. Several of these (possibly all of them? Not sure) were present in previous HC and UG runs with fewer samples, and I verified that the one I'm focusing on is called correctly when I only use five samples.

Debugging runs on a small region have revealed the following:

  1. ReduceReads does not seem to be the culprit, my variant is still uncalled when using the un-reduced bams
  2. My variant is not inside an Active Region
  3. When I force it to be with -forceActive, it's not in the trimmed ActiveRegion
  4. I've tried increasing -maxNumHaplotypesInPopulation as high as 1024, and the trimmed region still doesn't include my variant
  5. I've also tried running with -dontTrimActiveRegions, but haven't successfully finished yet (runtime increases from 30 seconds to over an hour, I keep trying to run it in short queues while I'm doing other stuff and getting killed by the scheduler)

A couple of other random notes that may or may not be applicable: These are rare variants that I only expect to see in 1 or 2 samples. My testing region is ~400bp around the variant in question. There is a variant in another sample at an immediately adjacent nucleotide that is also not called (and, perhaps obviously, is also outside the active regions).

Do you have any suggestions for approaching this? I haven't messed with -minPruning yet, as increasing that value should result in a loss of sensitivity and reducing it seems like a bad idea. I suppose I could split my cohort into subsets of 30 or 40 samples, but that doesn't seem like the best approach


Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Phil,

    You shouldn't have to deploy that much effort, so we think this is a case where HC is not doing the right thing. We'd like to help figure this out as it sounds like a good test case to improve sensitivity. Would you be able to share data with us so we can debug locally?

  • pdexheimerpdexheimer Member ✭✭✭✭

    Probably, but I'll have to clear it with the investigator and put together a coherent test case. I'll be in touch...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sure -- we can sign some NDA paperwork if necessary.

  • OprahOprah Member

    Maybe Phil doesn't need to share his data, because his problem is very similar to a couple of others from last week: "HaplotypeCaller sensitivity" and "HaplotypeCaller misses a true variant". Geraldine, any progress on those?

  • pdexheimerpdexheimer Member ✭✭✭✭

    Okay, permission obtained and data uploaded (pdexheimer_HC.tgz). I included a README.txt describing the expected and actual results. I promised the PI I would notify you that this data is unpublished and confidential - so don't go writing a Nature paper based on 500bp from a couple dozen random people :)

    Thanks for taking a look, hope this helps track down the bug

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks @pdexheimer, that was fast! We'll treat this appropriately and delete the data when we're done. And it would be pretty worrying if Nature accepted a paper like that! Not that I'm naive enough to think it couldn't happen... sigh

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Oprah, we've found the probable cause of the HaplotypeCaller misses a true variant issue. The other issue is still unresolved. These sensitivity issues are typically due to different underlying causes even though the overall symptoms appear to be similar, so we prefer to treat them individually rather than assume they are the same.

  • OprahOprah Member

    Phil, the release notes for 3.0 mention a few bug fixes. Are you going to test the new HC to see if it fixes the issue you reported? Thanks.

  • pdexheimerpdexheimer Member ✭✭✭✭

    I am. Hopefully I can start everything running tonight and have an answer at the beginning of next week

  • pdexheimerpdexheimer Member ✭✭✭✭

    @Oprah, I'm seeing major improvements. I won't do a thorough evaluation until Monday, but the concordance with my validated set is much higher. It's not 100%, but I remember seeing at least one "validated" variant with zero bam support that I would never expect to see called

  • OprahOprah Member

    Thanks for the quick reply Phil. I also see major improvements using HC + GenotypeGVCFs (but not HC alone, without GenotypeGVCFs)

  • pdexheimerpdexheimer Member ✭✭✭✭

    Yes, I definitely consider this matter solved. Of my 38 "confirmed" variants, the v3 pipeline called 35. On closer inspection, one has no read support and shouldn't be called, and two lie outside my analysis regions (I forgot to include the interval_padding argument. This is what happens when you do things in a hurry).

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Great! Thanks for letting us know.

  • I think I'm experiencing the same issue. I have 6567 samples in the cohort. I focus on one of the expected variants. 7 of them are mutation positive cases. If I generate a multiple sample bam with the 7 cases, the variant is correctly called acorss all the samples. But if I use the bam for alll the 6567 cases, the variant is missing. I'm using gatk-nightly-2016-01-24

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    If I generate a multiple sample bam with the 7 cases, the variant is correctly called acorss all the samples. But if I use the bam for alll the 6567 cases, the variant is missing.

    I am not sure I understand. Can you clarify what exactly you are doing in each case? Perhaps posting the exact commands you ran would help.


Sign In or Register to comment.