GATK licensing moves to direct-through-Broad model -- read about it on the GATK blog

HaplotypeCaller '--emitRefConfidence' gives same PL(0) to two genotypes

CarlosBorrotoCarlosBorroto Posts: 38Member
edited March 2014 in Ask the GATK team

Hi,

In our downstream analysis we have to differentiate between non-variants because of no coverage and confident reference calls. Because of this, we need to pay close attention to HaplotypeCaller non-variant calls. Doing this we found calls like this one:

17 7127955 . G <NON_REF> . . END=7127955 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,468

Would it be possible to get a feeling why this initial call is giving the same PL(0) to 0/0 and 0/1. If you look at the alignment,after "Data Pre-processing" best practices, we can see there is strong evidence this is a 0/0 position.

GenotypeGVCFs won't call this position as variant, which is good.

Thanks,
Carlos

PS: I tried to add a BAM file with the relevant region and I got "Uploaded file type is not allowed".

Post edited by CarlosBorroto on
Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    Hi Carlos,

    Can you post a screenshot of the site in IGV, and a few more comments about why you believe the call is incorrect?

    To upload data, you can use our FTP server (see FAQs for address and login info) -- but please don't upload anything until we ask for it specifically. If the devs decide they need to look at the BAM file to give their opinion, I will let you know. Thanks for understanding!

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 38Member

    Hi @Geraldine_VdAuwera‌ ,

    Thanks for looking into this. This is the view in IGV.

    Best,

    Carlos

    gatk_3.1_emitrefconfidence_PLs.png
    2034 x 1288 - 145K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    OK, that's pretty clean. Can you tell me which version this is from? I assume 3.0 or 3.1?

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 38Member

    This is 3.1.

    BTW, we do not think the call is wrong. We think this is a 0/0 as GATK calls it even before GenotypeGVCFs. We are curious of why the PL looks like '0,0,468'. Why HaplotypeCaller is as confident this is a 0/0 as 0/1?.

    --Carlos

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    Yes, I understand, and I agree those PLs are surprising. Have you looked at the annotations on the haplotypes that the HC outputs using -bamOut for this region?

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    Hi Carlos, could you please upload a snippet of the bam file with the problem region to our FTP server, so we can look at this more closely? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 38Member

    Hi @Geraldine_VdAuwera‌ ,

    I'm very grateful you are taking the time to look into this. I have uploaded the relevant information to the ftp server. Please look for file '12702.tar.bz2'. I think I included all the information you need to reproduce the issue, but please let me know if you need anything else.

    I haven't had the time to play with '-bamOut'. It is something I definitely want to try.

    --Carlos

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    Hi Carlos,

    Thanks for the data, this all looks good. I've put it in the bug tracker; the devs are pretty busy right now but they should be able to look at your bug report within a few days. I'll let you know in this thread when we have a better idea of what's going on.

    Geraldine Van der Auwera, PhD

  • evolvedmicrobeevolvedmicrobe MGHPosts: 14Member

    @Geraldine, one cause of this issue seems to be that if the GenotypeLikelihood is higher for the Heterozygote call, the Haplotype caller by default lowers the value to be equal to the homozygous reference likelihood, and so the PL values can wind up being equal for the homozygous reference and the heterozygote, leading to the puzzling 0,0,XXX values for the PL field. The bit of code that seems responsible for this is inside the RefVsAnyResult.java file, specifically the capByHomRefLikelihood method. Do you know what the reason for capping the values is? It seems like if the likelihood model is correct and the heterozygous state is the most probable genotype, there wouldn't be a need to lower the value to match the HomRefLikelihood.

  • valentinvalentin Cambridge, MAPosts: 19Member, GATK Dev mod

    @evolvedmicrobe, well spotted. This is a current limitation of the HaplotypeCaller GVCF output.

    At the point of the process where that adjustment is done, the HC has already committed to the output of a non-variant site after the local-assembly process has not found enough evidence for any particular alternative allele there. Thus to be consistent with that non-variant call the Hom-Ref genotype is constrained to be at least as like as any other genotype thus the resulting 0,0,XXX likelihoods.

    In most cases this should be due to the presence of surrounding high quality soft-clipped bases that cast doubt on a Hom-Ref call. However this is something we should revisit in future releases since high quality soft-clips might be derived from medium-large insertions that should not affect the confidence on contiguous non-variant sites. Unfortunately I cannot give you a time-frame for that though.

  • CarlosBorrotoCarlosBorroto Posts: 38Member

    Thanks so much for looking in to this issue. I want to share another example which makes us a little more worry.

    10 72357713 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:21,8:29:0:0,0,334

    In this case there is evidence this could be a Het. I looked into using '-bamout' to see how is HC assembling this regions and I attach an screenshot. Top original alignment, bottom '-bamout'.

    pl_issue.bamout.png
    1920 x 1200 - 181K
  • evolvedmicrobeevolvedmicrobe MGHPosts: 14Member

    @valentin, thanks for the response! I hope you won’t mind if I ask a quick follow up question, as I didn’t quite get your explanation. For the soft clipped bases that you mention, are these feeding into the Pileup used for the PL calculation, or in to the HC caller? (I gather the presence of soft clipped reads in one but not the other explained the discrepancy in the likelihoods for the heterozygote call, but wasn’t sure which one they showed up in). Thanks again!

  • valentinvalentin Cambridge, MAPosts: 19Member, GATK Dev mod

    @‌CarlosBorroto thanks for that additional example.

    The same applies here… in this case HC does not find evidence of variation on that region… this could be due to several reasons and we might need to do a more in-depth analysis of additional debug outputs to understand the exact reason.

    Then, and only when reference model confidence output is requested (GVCF or BP_RESOLUTION), HC goes into generating those non-variant sites position by position using the original alignment. At this point HC is committed to output a non-variant site, thus it won't ever do a het or hom-alt call. At the same time there is some evidence that suggest that the hom-ref is not a reliable call either thus the 0,0,XXX likelihood and GQ == 0.

    In practice the only difference between this case and the previous one is the nature of the evidence casting doubt on the hom-ref call. In the former it seems pretty much due to surrounding soft-clips, in the latter there are mismatching base calls.

  • valentinvalentin Cambridge, MAPosts: 19Member, GATK Dev mod
    edited April 2014

    @evolvedmicrobe actually the soft-clips are fed into both processes (1) the variant discovery phase using local de-novo assembly and (2) non-variant sites ref-confidence output generation using the original pileup (with high quality soft-clips unclipped).

    In that case (1) does not find evidence of a valid concrete variant on that region and so it does not add any corresponding variant site call to the output it need to fill the 'ALT' column with something concrete. However (2) is not constraint to make up its mind about a particular alternative allele, it just needs to know that there might be something out there, whatever that is (i.e. <NON_REF>), and that is what happens in this case.

    Even if these two aspects may seem contradictory, actually they are both compatible.

    I hope this makes it more clear.

    Post edited by valentin on
  • bredesonbredeson Posts: 33Member
    edited April 1

    Hey GATK Team,

    FYI: I was struggling with this same issue, but found the fix (for my specific case?). I'm posting this fix for posterity.

    I am using GATK v3.3-0 with --emitRefConfidence BP_RESOLUTION and I had good evidence that a particular variant existed and was real. My analysis focused on an individual hybrid of species A and B. From species B I had an individual that was homozygous for the (species B-differentiating) alternate allele and the raw aligned reads for the hybrid suggested it was heterozygous, as expected.

    When attempting to call the specific variant locus of interest, the HaplotypeCaller would call neighboring variants within about 50 bp of it, but would not call my target variant correctly. Looking at the --bamOutput from HC I could see that the HC had thrown all the variant reads out at that site and had only assembled the reference-like haplotypes at my target locus. The variant-allele bearing reads were not duplicates, secondary, or supplemental alignments, and had higher MapQ and BaseQ than my min thresholds. They variant alleles were also linked on the same reads/haplotype as successfully called neighboring het variants.

    After some fiddling with parameters over a test target region, it looks like the narrow range of --kmerSize (default k = 10 and 25) is responsible. Increasing the --kmerSize (or range of permitted sizes: 10, 25, 35 for example) seems to do the trick and the variant is correctly called. This would imply that the local context was non-unique at these k-mer sizes and the variant was pruned out of the read-threading assembly graph, and were therefore never assembled into haplotypes for the variant to be called.

    Post edited by bredeson on

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    13
    State
    open
    Last Updated
    Assignee
    Array
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    Thanks for reporting this, @bredeson. I'd like to put together a collection of helpful reports like this to illustrate the kind of advanced usage tweaks that people find useful to improve sensitivity in specific cases. I'm not sure yet when it will happen or what it'll look like exactly, but my sense is that there's potential for something like that to be a helpful resource if it's collated as opposed to being spread across many isolated discussions. Would you mind if I flag this for inclusion when we get around to that?

    Geraldine Van der Auwera, PhD

  • bredesonbredeson Posts: 33Member

    Hey @Geraldine_VdAuwera,

    I have no issue with you including this :-)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,759Administrator, GATK Dev admin

    @bredeson Thanks! The devs were actually a little surprised that you found tweaking the kmers helped in this case. Would you be willing to share a snippet of data for them to test locally?

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.