The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# HaplotypeCaller sensitivity in large(ish) cohorts

Member, Dev Posts: 544 ✭✭✭✭

One of my projects currently has ~150 patients (exomes) that I've been processing through the standard pipeline (2.8-1, including ReduceReads). In my most recent run through HC, I split the cohort in half for the sake of time. A subset of these patients have undergone targeted genotyping in the clinic, and I have a list of 36 validated variants in 28 samples. When I checked these variants in the final VCF, 5 of 36 were not called by HaplotypeCaller and have moderate to excellent support in the BAM. Several of these (possibly all of them? Not sure) were present in previous HC and UG runs with fewer samples, and I verified that the one I'm focusing on is called correctly when I only use five samples.

Debugging runs on a small region have revealed the following:

1. ReduceReads does not seem to be the culprit, my variant is still uncalled when using the un-reduced bams
2. My variant is not inside an Active Region
3. When I force it to be with -forceActive, it's not in the trimmed ActiveRegion
4. I've tried increasing -maxNumHaplotypesInPopulation as high as 1024, and the trimmed region still doesn't include my variant
5. I've also tried running with -dontTrimActiveRegions, but haven't successfully finished yet (runtime increases from 30 seconds to over an hour, I keep trying to run it in short queues while I'm doing other stuff and getting killed by the scheduler)

A couple of other random notes that may or may not be applicable: These are rare variants that I only expect to see in 1 or 2 samples. My testing region is ~400bp around the variant in question. There is a variant in another sample at an immediately adjacent nucleotide that is also not called (and, perhaps obviously, is also outside the active regions).

Do you have any suggestions for approaching this? I haven't messed with -minPruning yet, as increasing that value should result in a loss of sensitivity and reducing it seems like a bad idea. I suppose I could split my cohort into subsets of 30 or 40 samples, but that doesn't seem like the best approach

Tagged:

Hi Phil,

You shouldn't have to deploy that much effort, so we think this is a case where HC is not doing the right thing. We'd like to help figure this out as it sounds like a good test case to improve sensitivity. Would you be able to share data with us so we can debug locally?

Geraldine Van der Auwera, PhD

• Member, Dev Posts: 544 ✭✭✭✭

Probably, but I'll have to clear it with the investigator and put together a coherent test case. I'll be in touch...

Sure -- we can sign some NDA paperwork if necessary.

Geraldine Van der Auwera, PhD

• Member Posts: 29

Maybe Phil doesn't need to share his data, because his problem is very similar to a couple of others from last week: "HaplotypeCaller sensitivity" and "HaplotypeCaller misses a true variant". Geraldine, any progress on those?

• Member, Dev Posts: 544 ✭✭✭✭

Okay, permission obtained and data uploaded (pdexheimer_HC.tgz). I included a README.txt describing the expected and actual results. I promised the PI I would notify you that this data is unpublished and confidential - so don't go writing a Nature paper based on 500bp from a couple dozen random people

Thanks for taking a look, hope this helps track down the bug

Thanks @pdexheimer, that was fast! We'll treat this appropriately and delete the data when we're done. And it would be pretty worrying if Nature accepted a paper like that! Not that I'm naive enough to think it couldn't happen... sigh

Geraldine Van der Auwera, PhD

@Oprah, we've found the probable cause of the HaplotypeCaller misses a true variant issue. The other issue is still unresolved. These sensitivity issues are typically due to different underlying causes even though the overall symptoms appear to be similar, so we prefer to treat them individually rather than assume they are the same.

Geraldine Van der Auwera, PhD

• Member Posts: 29

Phil, the release notes for 3.0 mention a few bug fixes. Are you going to test the new HC to see if it fixes the issue you reported? Thanks.

• Member, Dev Posts: 544 ✭✭✭✭

I am. Hopefully I can start everything running tonight and have an answer at the beginning of next week

• Member, Dev Posts: 544 ✭✭✭✭

@Oprah, I'm seeing major improvements. I won't do a thorough evaluation until Monday, but the concordance with my validated set is much higher. It's not 100%, but I remember seeing at least one "validated" variant with zero bam support that I would never expect to see called

• Member Posts: 29

Thanks for the quick reply Phil. I also see major improvements using HC + GenotypeGVCFs (but not HC alone, without GenotypeGVCFs)

• Member, Dev Posts: 544 ✭✭✭✭

Yes, I definitely consider this matter solved. Of my 38 "confirmed" variants, the v3 pipeline called 35. On closer inspection, one has no read support and shouldn't be called, and two lie outside my analysis regions (I forgot to include the interval_padding argument. This is what happens when you do things in a hurry).