To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Coverage bias in HaplotypeCaller

Hi,

I am doing joint variant calling for Illumina paired end data of 150 monkeys. Coverage varies from 3-30 X with most individuals having around 4X coverage.

I was doing all the variant detection and hard-filtering (GATK Best Practices) process with both UnifiedGenotyper and Haplotype caller.

My problem is that HaplotypeCaller shows a much stronger bias for calling the reference allele in low coverage individuals than UnifiedGenotyper does. Is this a known issue?

In particular, consider pairwise differences across individuals:
The absolute values are lower for low coverage individuals than for high coverage, for both methods, since it is more difficult to make calls for them.
However, for UnifiedGenotyper, I can correct for this by calculating the "accessible genome size" for each pair of individuals by substracting from the total reference length all the filtered sites and sites where one of the two individuals has no genotype call (./.). If I do this, there is no bias in pairwise differences for UnifiedGenotyper. Values are comparable for low and high coverage individuals (If both pairs consist of members of similar populations).

However, for HaplotypeCaller, this correction does not remove bias due to coverage. Hence, it seems that for UnifiedGenotyper low coverage individuals are more likely to have no call (./.) but if there is a call it is not biased towards reference or alternative allele (at least compared to high coverage individuals). For HaplotypeCaller, on the other hand, it seems that in cases of doubt the genotype is more likely to be set to reference. I can imagine that this is an effect of looking for similar haplotypes in the population.

Can you confirm this behaviour? For population genetic analysis this effect is highly problematic. I would trade in more false positive if this removed the bias. Note that when running HaplotypeCaller, I used a value of 3*10^(-3) for the expected heterozygosity (--heterozygosity) which is the average cross individuals diversity and thus already at the higher-end for within individual heterozygosity. I would expect the problem to be even worse if I chose lower values.

Can you give me any recommendation, should I go back using UnifiedGenotyper or is there any way to solve this problem?

Many thanks in advance,
Hannes

Answers

  • feilchenfeldtfeilchenfeldt ViennaMember

    Ok, I think the answer is that I would have to specifiy the option --input_prior.

    However, it is not clear to me which prior to use. I have individuals from 5 populations and the reference allele is not always ancestral (but in 80% of the cases) so the site frequency spectrum is expected to be complicated.

    Is a flat prior the best thing to use here? It seems quite extreme to me, since overall I still expect most variants to be rare. Or a neutral prior where 20% of the site frequencies are mirrored to reflect that the reference allele is not always ancestral?

    Thanks a lot,
    Hannes

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Hannes,

    It is known that HaplotypeCaller's sensitivity decreases at very low coverage values; 4x per individual is going to be problematic. But the bias for reference at low depth happens at the assembly step, so the solution I would recommend, rather than fiddling with priors (which come in later in the process) is to relax the HC's assembly parameters. Off the top of my head I'd say you need to look at -minPruning (which is the number of reads required to not prune a node from the assembly graph) first. There may be a couple of others that I'm not thinking of right now -- @Sheila may be able to help you figure those out. And you'll want to look at the HC's tool doc.

  • feilchenfeldtfeilchenfeldt ViennaMember

    Hi Geraldine,

    Thanks a lot for the answer. In can confirm that a flat prior does not solve the problem.
    I am trying out -minPruning 1 now.
    I would be very grateful if @Sheila or anyonw else had any further ideas for how to avoid reference bias in low coverage individuals.

    For instance, I would be happy if sites with too little information for making a call would be set missing instead of calling reference.
    Otherwise I will just move back to UnifiedGenotyper which does not show any bias once one controls for missing values.

    Thanks so much for your time,
    Hannes

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @feilchenfeldt
    Hi Hannes,

    GATK is tested on 30x coverage data, so it performs best on 30x coverage data. However, you can try setting -minPruning 1 and -minDanglingBranchLength 1. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--minPruning
    https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--minDanglingBranchLength
    I have found these two parameters to work well at making calls in low coverage regions. You can always look at the PLs to see how confident we are that the genotype is correct. The QUAL score will tell you how confident we are that a variant actually exists at the position.

    Good luck!

    -Sheila

Sign In or Register to comment.