Can I retrieve the haplotypes assembled from HaplotypeCaller?

I am using HaplotypeCaller with a sample which I expect to be entirely homozygous. Therefore I only really expect 1 haplotype per site and others may be reads derived from paralogs/repeats not present in the reference genome.

When I use HaplotypeCaller it would be useful if I could:
(1) retrieve the haplotypes created - it's likely only the one most similar to the reference is the appropriate haplotype for the region
(2) Retrieve reads from the alternate haplotypes
(3) Exclude reads that are likely derived from those haplotypes

What would happen if I used homozygous sample and set "--maxNumHaplotypesInPopulation" to 1
how does it choose the 1 haplotype and what happens to reads that don't match that haplotype?


Best Answers


  • rwnessrwness Member

    Thanks a lot Sheila.
    I often find large regions of the genome I am working on appear heterozygous when in fact I know they are homozygous. Visual inspection reveals that there are often two distinct haplotypes, supported my multiple SNPs found in perfect LD, one more divergent from the reference. I think this pattern is from paralogous regions.

    Using this setting may allow me and others to eliminate these problematic reads when we know a region is homozygous (or haploid, or hemizygous) AND we only use one sample at a time.

    If anyone else has ever used this approach on the Y chromosome or something similar I would be interested to hear about it.


    • Rob
  • rwnessrwness Member

    Thanks Geraldine, I will check it out and try it on some of the problem regions. I am actually working on a haploid organism with a relatively diverse genome (30x more diverse than human). If the new version of HC will allow me to ignore/exclude reads that appear to be derived from a paralogous region it will be very useful. I do a lot of calling of new mutations and by FAR the biggest source of false variant calls is mapping errors from paralogs.
    I would think that some people might like to be able to output the most likely haplotypes for each active region, in my case these would be the paralogs themselves.

    Thanks again

Sign In or Register to comment.