The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# HaplotypeCaller '--emitRefConfidence' gives same PL(0) to two genotypes

Member Posts: 46
edited March 2014

Hi,

In our downstream analysis we have to differentiate between non-variants because of no coverage and confident reference calls. Because of this, we need to pay close attention to HaplotypeCaller non-variant calls. Doing this we found calls like this one:

17 7127955 . G <NON_REF> . . END=7127955 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,468

Would it be possible to get a feeling why this initial call is giving the same PL(0) to 0/0 and 0/1. If you look at the alignment,after "Data Pre-processing" best practices, we can see there is strong evidence this is a 0/0 position.

GenotypeGVCFs won't call this position as variant, which is good.

Thanks,
Carlos

PS: I tried to add a BAM file with the relevant region and I got "Uploaded file type is not allowed".

Tagged:

Hi Carlos,

Can you post a screenshot of the site in IGV, and a few more comments about why you believe the call is incorrect?

To upload data, you can use our FTP server (see FAQs for address and login info) -- but please don't upload anything until we ask for it specifically. If the devs decide they need to look at the BAM file to give their opinion, I will let you know. Thanks for understanding!

Geraldine Van der Auwera, PhD

• Member Posts: 46

Thanks for looking into this. This is the view in IGV.

Best,

Carlos

OK, that's pretty clean. Can you tell me which version this is from? I assume 3.0 or 3.1?

Geraldine Van der Auwera, PhD

• Member Posts: 46

This is 3.1.

BTW, we do not think the call is wrong. We think this is a 0/0 as GATK calls it even before GenotypeGVCFs. We are curious of why the PL looks like '0,0,468'. Why HaplotypeCaller is as confident this is a 0/0 as 0/1?.

--Carlos

Yes, I understand, and I agree those PLs are surprising. Have you looked at the annotations on the haplotypes that the HC outputs using -bamOut for this region?

Geraldine Van der Auwera, PhD

Hi Carlos, could you please upload a snippet of the bam file with the problem region to our FTP server, so we can look at this more closely? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

Geraldine Van der Auwera, PhD

• Member Posts: 46

I'm very grateful you are taking the time to look into this. I have uploaded the relevant information to the ftp server. Please look for file '12702.tar.bz2'. I think I included all the information you need to reproduce the issue, but please let me know if you need anything else.

I haven't had the time to play with '-bamOut'. It is something I definitely want to try.

--Carlos

Hi Carlos,

Thanks for the data, this all looks good. I've put it in the bug tracker; the devs are pretty busy right now but they should be able to look at your bug report within a few days. I'll let you know in this thread when we have a better idea of what's going on.

Geraldine Van der Auwera, PhD

• MGHMember Posts: 16

@Geraldine, one cause of this issue seems to be that if the GenotypeLikelihood is higher for the Heterozygote call, the Haplotype caller by default lowers the value to be equal to the homozygous reference likelihood, and so the PL values can wind up being equal for the homozygous reference and the heterozygote, leading to the puzzling 0,0,XXX values for the PL field. The bit of code that seems responsible for this is inside the RefVsAnyResult.java file, specifically the capByHomRefLikelihood method. Do you know what the reason for capping the values is? It seems like if the likelihood model is correct and the heterozygous state is the most probable genotype, there wouldn't be a need to lower the value to match the HomRefLikelihood.

• Cambridge, MAMember, Dev Posts: 59 ✭✭

@evolvedmicrobe, well spotted. This is a current limitation of the HaplotypeCaller GVCF output.

At the point of the process where that adjustment is done, the HC has already committed to the output of a non-variant site after the local-assembly process has not found enough evidence for any particular alternative allele there. Thus to be consistent with that non-variant call the Hom-Ref genotype is constrained to be at least as like as any other genotype thus the resulting 0,0,XXX likelihoods.

In most cases this should be due to the presence of surrounding high quality soft-clipped bases that cast doubt on a Hom-Ref call. However this is something we should revisit in future releases since high quality soft-clips might be derived from medium-large insertions that should not affect the confidence on contiguous non-variant sites. Unfortunately I cannot give you a time-frame for that though.

• Member Posts: 46

Thanks so much for looking in to this issue. I want to share another example which makes us a little more worry.

10 72357713 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:21,8:29:0:0,0,334

In this case there is evidence this could be a Het. I looked into using '-bamout' to see how is HC assembling this regions and I attach an screenshot. Top original alignment, bottom '-bamout'.

• MGHMember Posts: 16

@valentin, thanks for the response! I hope you won’t mind if I ask a quick follow up question, as I didn’t quite get your explanation. For the soft clipped bases that you mention, are these feeding into the Pileup used for the PL calculation, or in to the HC caller? (I gather the presence of soft clipped reads in one but not the other explained the discrepancy in the likelihoods for the heterozygote call, but wasn’t sure which one they showed up in). Thanks again!

• Cambridge, MAMember, Dev Posts: 59 ✭✭

@‌CarlosBorroto thanks for that additional example.

The same applies here… in this case HC does not find evidence of variation on that region… this could be due to several reasons and we might need to do a more in-depth analysis of additional debug outputs to understand the exact reason.

Then, and only when reference model confidence output is requested (GVCF or BP_RESOLUTION), HC goes into generating those non-variant sites position by position using the original alignment. At this point HC is committed to output a non-variant site, thus it won't ever do a het or hom-alt call. At the same time there is some evidence that suggest that the hom-ref is not a reliable call either thus the 0,0,XXX likelihood and GQ == 0.

In practice the only difference between this case and the previous one is the nature of the evidence casting doubt on the hom-ref call. In the former it seems pretty much due to surrounding soft-clips, in the latter there are mismatching base calls.

• Cambridge, MAMember, Dev Posts: 59 ✭✭
edited April 2014

@evolvedmicrobe actually the soft-clips are fed into both processes (1) the variant discovery phase using local de-novo assembly and (2) non-variant sites ref-confidence output generation using the original pileup (with high quality soft-clips unclipped).

In that case (1) does not find evidence of a valid concrete variant on that region and so it does not add any corresponding variant site call to the output it need to fill the 'ALT' column with something concrete. However (2) is not constraint to make up its mind about a particular alternative allele, it just needs to know that there might be something out there, whatever that is (i.e. <NON_REF>), and that is what happens in this case.

Even if these two aspects may seem contradictory, actually they are both compatible.

I hope this makes it more clear.

Post edited by valentin on
• Member Posts: 33
edited April 2015

Hey GATK Team,

FYI: I was struggling with this same issue, but found the fix (for my specific case?). I'm posting this fix for posterity.

I am using GATK v3.3-0 with --emitRefConfidence BP_RESOLUTION and I had good evidence that a particular variant existed and was real. My analysis focused on an individual hybrid of species A and B. From species B I had an individual that was homozygous for the (species B-differentiating) alternate allele and the raw aligned reads for the hybrid suggested it was heterozygous, as expected.

When attempting to call the specific variant locus of interest, the HaplotypeCaller would call neighboring variants within about 50 bp of it, but would not call my target variant correctly. Looking at the --bamOutput from HC I could see that the HC had thrown all the variant reads out at that site and had only assembled the reference-like haplotypes at my target locus. The variant-allele bearing reads were not duplicates, secondary, or supplemental alignments, and had higher MapQ and BaseQ than my min thresholds. They variant alleles were also linked on the same reads/haplotype as successfully called neighboring het variants.

After some fiddling with parameters over a test target region, it looks like the narrow range of --kmerSize (default k = 10 and 25) is responsible. Increasing the --kmerSize` (or range of permitted sizes: 10, 25, 35 for example) seems to do the trick and the variant is correctly called. This would imply that the local context was non-unique at these k-mer sizes and the variant was pruned out of the read-threading assembly graph, and were therefore never assembled into haplotypes for the variant to be called.

#### Issue · Github April 2015 by Geraldine_VdAuwera

Issue Number
13
State
open
Last Updated
Assignee
Array

Thanks for reporting this, @bredeson. I'd like to put together a collection of helpful reports like this to illustrate the kind of advanced usage tweaks that people find useful to improve sensitivity in specific cases. I'm not sure yet when it will happen or what it'll look like exactly, but my sense is that there's potential for something like that to be a helpful resource if it's collated as opposed to being spread across many isolated discussions. Would you mind if I flag this for inclusion when we get around to that?

Geraldine Van der Auwera, PhD

• Member Posts: 33

I have no issue with you including this :-)