Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Phasing via HaplotypeCaller vs. ReadBackedPhasing

brcopelandbrcopeland New York, New York, USAMember

I recently started running ReadBackedPhasing in the interest of correctly deducing which variants were MNPs, which got me to looking at the phasing information represented. I know in a GATK FAQ it is noted that ReadBackedPhasing is arguably unnecessary for getting phasing (as HC does it already), but I assessed the positions in a WES sample with both sets of annotations and found the seeming low rate of concordance between the two odd.

Specifically, I found 19,559 variants phased with at least one other variant as indicated by HaplotypeCaller, whereas I found 12,115 as indicated by ReadBackedPhasing. I noted that the distribution of pairwise distances between phased variants was quite a bit higher in ReadBackedPhasing, but I think that can largely be explained by HaplotypeCaller considering a lower window size (by default, which I have not adjusted). The oddity to me is that I determined only 1,658 sites to be concordant, i.e. both tools indicated they were phased with at least one other variant. I would have expected ReadBackedPhasing's phased variant sites to essentially be a superset of those produced by HaplotypeCaller, but this certainly does not appear to be the case. Is this expected behavior? Is my interpretation incorrect in some manner? Thanks for any insight; I'm trying to determine which phasing info we want to retain for downstream interpretation.

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @brcopeland
    Hi,

    Can you tell us how you are determining the concordance rate?

    Thanks
    Sheila

  • brcopelandbrcopeland New York, New York, USAMember

    Hi Sheila, thanks for your response. I just mean it at the most basic level, i.e. sites that were phased in one or both tools. I retained the positions of all sites that had the same phasing block annotation in >= 2 sites for both tools, and this is the source of the numbers I quote.

    Brett

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, we've never done side-by-side comparisons so I'm not sure I can shed much light on this. I think you're right about your observation re: pairwise distance and the relation to their respective window sizes. I agree the concordance rate seems very low; it would be interesting to see how much is explained by representation problems, and the limitation of RBP being only able to phase biallelic sites.

  • brcopelandbrcopeland New York, New York, USAMember

    Geraldine, a few points for clarification:
    1. I observed HC does phasing at homozygous sites; when I excluded homozygotes, my number of positions phased per HC dropped from 19,559 to 4,541 while those concordant predictably remained the same (RBP only uses heterozygotes). Still odd to me only 1,658/4,541 in HC were also reported by RBP.
    2. I'm not sure what you mean by representation problems, but it is true these were very basic tests conducted (on one VCF).
    3. These are all biallelic sites as this was for just one sample.

    I know you said in my other thread RBP is set to be deprecated so I imagine we probably won't be able to figure this out, but thought there was a (slight) possibility those with more experience with this tool might have some insight.

    Thanks again for your help,
    Brett

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @brcopeland Sorry for the late response, we were at an institute-wide retreat.

    1. HC has access to more information than RBP so I'm not surprised that it would manage to phase more sites -- though the delta is admittedly bigger than I would have expected.
    2. By representation problems I mean that sometimes events that are close by may be represented in different ways. Since RBP goes back to the original read mappings, it's not working from exactly the same data as HC, which internally realigns reads (sometimes dramatically, especially in regions with lots of repeats). Therefore I would expect some loss in concordance from that.

    Would be great to get more insight into this and I definitely encourage anyone with more experience of RBP to jump into the conversation!

Sign In or Register to comment.