Phasing via HaplotypeCaller vs. ReadBackedPhasing
I recently started running ReadBackedPhasing in the interest of correctly deducing which variants were MNPs, which got me to looking at the phasing information represented. I know in a GATK FAQ it is noted that ReadBackedPhasing is arguably unnecessary for getting phasing (as HC does it already), but I assessed the positions in a WES sample with both sets of annotations and found the seeming low rate of concordance between the two odd.
Specifically, I found 19,559 variants phased with at least one other variant as indicated by HaplotypeCaller, whereas I found 12,115 as indicated by ReadBackedPhasing. I noted that the distribution of pairwise distances between phased variants was quite a bit higher in ReadBackedPhasing, but I think that can largely be explained by HaplotypeCaller considering a lower window size (by default, which I have not adjusted). The oddity to me is that I determined only 1,658 sites to be concordant, i.e. both tools indicated they were phased with at least one other variant. I would have expected ReadBackedPhasing's phased variant sites to essentially be a superset of those produced by HaplotypeCaller, but this certainly does not appear to be the case. Is this expected behavior? Is my interpretation incorrect in some manner? Thanks for any insight; I'm trying to determine which phasing info we want to retain for downstream interpretation.