Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

HaplotypeCaller '--emitRefConfidence' gives same PL(0) to two genotypes

CarlosBorrotoCarlosBorroto Posts: 33Member
edited March 25 in Ask the GATK team

Hi,

In our downstream analysis we have to differentiate between non-variants because of no coverage and confident reference calls. Because of this, we need to pay close attention to HaplotypeCaller non-variant calls. Doing this we found calls like this one:

17 7127955 . G <NON_REF> . . END=7127955 GT:DP:GQ:MIN_DP:PL 0/0:29:0:29:0,0,468

Would it be possible to get a feeling why this initial call is giving the same PL(0) to 0/0 and 0/1. If you look at the alignment,after "Data Pre-processing" best practices, we can see there is strong evidence this is a 0/0 position.

GenotypeGVCFs won't call this position as variant, which is good.

Thanks, Carlos

PS: I tried to add a BAM file with the relevant region and I got "Uploaded file type is not allowed".

Post edited by CarlosBorroto on
Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    Hi Carlos,

    Can you post a screenshot of the site in IGV, and a few more comments about why you believe the call is incorrect?

    To upload data, you can use our FTP server (see FAQs for address and login info) -- but please don't upload anything until we ask for it specifically. If the devs decide they need to look at the BAM file to give their opinion, I will let you know. Thanks for understanding!

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 33Member

    Hi @Geraldine_VdAuwera‌ ,

    Thanks for looking into this. This is the view in IGV.

    Best,

    Carlos

    gatk_3.1_emitrefconfidence_PLs.png
    2034 x 1288 - 145K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    OK, that's pretty clean. Can you tell me which version this is from? I assume 3.0 or 3.1?

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 33Member

    This is 3.1.

    BTW, we do not think the call is wrong. We think this is a 0/0 as GATK calls it even before GenotypeGVCFs. We are curious of why the PL looks like '0,0,468'. Why HaplotypeCaller is as confident this is a 0/0 as 0/1?.

    --Carlos

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    Yes, I understand, and I agree those PLs are surprising. Have you looked at the annotations on the haplotypes that the HC outputs using -bamOut for this region?

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    Hi Carlos, could you please upload a snippet of the bam file with the problem region to our FTP server, so we can look at this more closely? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • CarlosBorrotoCarlosBorroto Posts: 33Member

    Hi @Geraldine_VdAuwera‌ ,

    I'm very grateful you are taking the time to look into this. I have uploaded the relevant information to the ftp server. Please look for file '12702.tar.bz2'. I think I included all the information you need to reproduce the issue, but please let me know if you need anything else.

    I haven't had the time to play with '-bamOut'. It is something I definitely want to try.

    --Carlos

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    Hi Carlos,

    Thanks for the data, this all looks good. I've put it in the bug tracker; the devs are pretty busy right now but they should be able to look at your bug report within a few days. I'll let you know in this thread when we have a better idea of what's going on.

    Geraldine Van der Auwera, PhD

  • evolvedmicrobeevolvedmicrobe MGHPosts: 14Member

    @Geraldine, one cause of this issue seems to be that if the GenotypeLikelihood is higher for the Heterozygote call, the Haplotype caller by default lowers the value to be equal to the homozygous reference likelihood, and so the PL values can wind up being equal for the homozygous reference and the heterozygote, leading to the puzzling 0,0,XXX values for the PL field. The bit of code that seems responsible for this is inside the RefVsAnyResult.java file, specifically the capByHomRefLikelihood method. Do you know what the reason for capping the values is? It seems like if the likelihood model is correct and the heterozygous state is the most probable genotype, there wouldn't be a need to lower the value to match the HomRefLikelihood.

  • valentinvalentin Posts: 16Member, GATK Developer mod

    @evolvedmicrobe, well spotted. This is a current limitation of the HaplotypeCaller GVCF output.

    At the point of the process where that adjustment is done, the HC has already committed to the output of a non-variant site after the local-assembly process has not found enough evidence for any particular alternative allele there. Thus to be consistent with that non-variant call the Hom-Ref genotype is constrained to be at least as like as any other genotype thus the resulting 0,0,XXX likelihoods.

    In most cases this should be due to the presence of surrounding high quality soft-clipped bases that cast doubt on a Hom-Ref call. However this is something we should revisit in future releases since high quality soft-clips might be derived from medium-large insertions that should not affect the confidence on contiguous non-variant sites. Unfortunately I cannot give you a time-frame for that though.

  • CarlosBorrotoCarlosBorroto Posts: 33Member

    Thanks so much for looking in to this issue. I want to share another example which makes us a little more worry.

    10 72357713 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:21,8:29:0:0,0,334

    In this case there is evidence this could be a Het. I looked into using '-bamout' to see how is HC assembling this regions and I attach an screenshot. Top original alignment, bottom '-bamout'.

    pl_issue.bamout.png
    1920 x 1200 - 181K
  • evolvedmicrobeevolvedmicrobe MGHPosts: 14Member

    @valentin, thanks for the response! I hope you won’t mind if I ask a quick follow up question, as I didn’t quite get your explanation. For the soft clipped bases that you mention, are these feeding into the Pileup used for the PL calculation, or in to the HC caller? (I gather the presence of soft clipped reads in one but not the other explained the discrepancy in the likelihoods for the heterozygote call, but wasn’t sure which one they showed up in). Thanks again!

  • valentinvalentin Posts: 16Member, GATK Developer mod

    @‌CarlosBorroto thanks for that additional example.

    The same applies here… in this case HC does not find evidence of variation on that region… this could be due to several reasons and we might need to do a more in-depth analysis of additional debug outputs to understand the exact reason.

    Then, and only when reference model confidence output is requested (GVCF or BP_RESOLUTION), HC goes into generating those non-variant sites position by position using the original alignment. At this point HC is committed to output a non-variant site, thus it won't ever do a het or hom-alt call. At the same time there is some evidence that suggest that the hom-ref is not a reliable call either thus the 0,0,XXX likelihood and GQ == 0.

    In practice the only difference between this case and the previous one is the nature of the evidence casting doubt on the hom-ref call. In the former it seems pretty much due to surrounding soft-clips, in the latter there are mismatching base calls.

  • valentinvalentin Posts: 16Member, GATK Developer mod
    edited April 17

    @evolvedmicrobe actually the soft-clips are fed into both processes (1) the variant discovery phase using local de-novo assembly and (2) non-variant sites ref-confidence output generation using the original pileup (with high quality soft-clips unclipped).

    In that case (1) does not find evidence of a valid concrete variant on that region and so it does not add any corresponding variant site call to the output it need to fill the 'ALT' column with something concrete. However (2) is not constraint to make up its mind about a particular alternative allele, it just needs to know that there might be something out there, whatever that is (i.e. <NON_REF>), and that is what happens in this case.

    Even if these two aspects may seem contradictory, actually they are both compatible.

    I hope this makes it more clear.

    Post edited by valentin on
Sign In or Register to comment.