We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Coverage bias in HaplotypeCaller


I am doing joint variant calling for Illumina paired end data of 150 monkeys. Coverage varies from 3-30 X with most individuals having around 4X coverage.

I was doing all the variant detection and hard-filtering (GATK Best Practices) process with both UnifiedGenotyper and Haplotype caller.

My problem is that HaplotypeCaller shows a much stronger bias for calling the reference allele in low coverage individuals than UnifiedGenotyper does. Is this a known issue?

In particular, consider pairwise differences across individuals:
The absolute values are lower for low coverage individuals than for high coverage, for both methods, since it is more difficult to make calls for them.
However, for UnifiedGenotyper, I can correct for this by calculating the "accessible genome size" for each pair of individuals by substracting from the total reference length all the filtered sites and sites where one of the two individuals has no genotype call (./.). If I do this, there is no bias in pairwise differences for UnifiedGenotyper. Values are comparable for low and high coverage individuals (If both pairs consist of members of similar populations).

However, for HaplotypeCaller, this correction does not remove bias due to coverage. Hence, it seems that for UnifiedGenotyper low coverage individuals are more likely to have no call (./.) but if there is a call it is not biased towards reference or alternative allele (at least compared to high coverage individuals). For HaplotypeCaller, on the other hand, it seems that in cases of doubt the genotype is more likely to be set to reference. I can imagine that this is an effect of looking for similar haplotypes in the population.

Can you confirm this behaviour? For population genetic analysis this effect is highly problematic. I would trade in more false positive if this removed the bias. Note that when running HaplotypeCaller, I used a value of 3*10^(-3) for the expected heterozygosity (--heterozygosity) which is the average cross individuals diversity and thus already at the higher-end for within individual heterozygosity. I would expect the problem to be even worse if I chose lower values.

Can you give me any recommendation, should I go back using UnifiedGenotyper or is there any way to solve this problem?

Many thanks in advance,


  • feilchenfeldtfeilchenfeldt ViennaMember

    Ok, I think the answer is that I would have to specifiy the option --input_prior.

    However, it is not clear to me which prior to use. I have individuals from 5 populations and the reference allele is not always ancestral (but in 80% of the cases) so the site frequency spectrum is expected to be complicated.

    Is a flat prior the best thing to use here? It seems quite extreme to me, since overall I still expect most variants to be rare. Or a neutral prior where 20% of the site frequencies are mirrored to reflect that the reference allele is not always ancestral?

    Thanks a lot,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Hannes,

    It is known that HaplotypeCaller's sensitivity decreases at very low coverage values; 4x per individual is going to be problematic. But the bias for reference at low depth happens at the assembly step, so the solution I would recommend, rather than fiddling with priors (which come in later in the process) is to relax the HC's assembly parameters. Off the top of my head I'd say you need to look at -minPruning (which is the number of reads required to not prune a node from the assembly graph) first. There may be a couple of others that I'm not thinking of right now -- @Sheila may be able to help you figure those out. And you'll want to look at the HC's tool doc.

  • feilchenfeldtfeilchenfeldt ViennaMember

    Hi Geraldine,

    Thanks a lot for the answer. In can confirm that a flat prior does not solve the problem.
    I am trying out -minPruning 1 now.
    I would be very grateful if @Sheila or anyonw else had any further ideas for how to avoid reference bias in low coverage individuals.

    For instance, I would be happy if sites with too little information for making a call would be set missing instead of calling reference.
    Otherwise I will just move back to UnifiedGenotyper which does not show any bias once one controls for missing values.

    Thanks so much for your time,

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Hannes,

    GATK is tested on 30x coverage data, so it performs best on 30x coverage data. However, you can try setting -minPruning 1 and -minDanglingBranchLength 1. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--minPruning
    I have found these two parameters to work well at making calls in low coverage regions. You can always look at the PLs to see how confident we are that the genotype is correct. The QUAL score will tell you how confident we are that a variant actually exists at the position.

    Good luck!


  • MehulSMehulS Member

    Hi @Sheila I was wondering if there's any update on this ? Are these parameters still recommended (particularly while joint genotyping) or has this bias been resolved ?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin


    We have had quite a few version improvements since that post in 2015. The details of which can be found here: https://github.com/broadinstitute/gatk/releases

    Which recommended parameter are you referring to specifically? Which version of gatk are you using? If you could give us more details about what you are trying to achieve with your data and what the issues you are facing with respect to that, I could provide you with more details.

  • MehulSMehulS Member
    edited January 2019

    Thank you. Bhanu I looked inside the forums and found out that GATK4 has had significant improvements wrt Haplotype Caller Reference bias in low coverage samples.

    I am using v4.0.12 and plan to call variants in targeted intervals using the joint calling best practices workflow on 5-10X coverage samples but was concerned with reports of reference bias in HC.

    The parameters I was referring to were those recommended by Sheila (-minPruning1 and -minDanglingBranchLength 1).

Sign In or Register to comment.