HaplotypeCaller missing a clear variant

fa8@sanger.ac.ukfa8@sanger.ac.uk Sanger InstituteMember
edited June 2016 in Ask the GATK team

Dear colleagues,

I am comparing two callers and found some cases in which HC behaves strangely.
Let's look into one of the cases. There's a site in which there are 15 bases overlapping. Of these, 6 are like the reference (C) and 9 are variants (T). Please, look at the attached IGV's snapshot. The reads are of good quality; same for the bases. And there are no indels nor mismatches around. How is it possible that HC genotypes this site as 0/0?

20 60216651 . C . . . AN=2;DP=14 GT:AD:DP:RGQ 0/0:6:14:0

These are the running parameters:
-T HaplotypeCaller --genotyping_mode DISCOVERY -ERC BP_RESOLUTION -R human.fasta --dbsnp dbsnp_138.vcf.gz --ploidy 2 -o output.gvcf.gz -I input.bam

-T GenotypeGVCFs -R human.fasta --includeNonVariantSites --dbsnp dbsnp_138.vcf.gz --variant output.gvcf.gz -o output.genotypes.gvcf.gz

I am using GATK version v3.4-0-g7e26428



  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi Federico,

    As we are now on version 3.6, enough things have changed in the HaplotypeCaller (and bugs fixed) that it's not worth our time to investigate calls made (or not made) with older versions. If you try with 3.6 and find that it's still not being called, we'll take a look at it.
  • fa8@sanger.ac.ukfa8@sanger.ac.uk Sanger InstituteMember
    edited June 2016

    Hi Geraldine,

    I should have tested with the latest version first, sorry. I've done so and also asked HaplotypeCaller to save the realigned bam (-bamout). That bam shows the variant is still there, but HaplotypeCaller does not call it.
    With the new version, now it does not call this site as "0/0" but as "./.", which is much better:

    20 60216651 . C 25.24 LowQual DP=14;MLEAC=0;MLEAF=NaN GT:AD:DP ./.:6,8:14

    I will move to this new version. Anyway, I would love to understand why the site has not been called. Mapping and base qualities are good.
    In the attached snapshot I've coloured the reads by sample, pink corresponding to the HC assemblies.

    Many thanks,

  • fa8@sanger.ac.ukfa8@sanger.ac.uk Sanger InstituteMember

    A zoom out of the region... I have short reads - working with ancient DNA

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Federico,

    The issue is that HaplotypeCaller is not confident whether the call is hom-ref or het. Note the GQ of 0, which is why in 3.6, the tool assigns a no-call to the site. I have a few things for you to try, but first, can you try adding -allowNonUniqueKmersInRef to your command?


  • fa8@sanger.ac.ukfa8@sanger.ac.uk Sanger InstituteMember

    Hi Sheila,

    Many thanks for the feedback. The new version's result ("./.") is not bad. What I didn't understand is why the older version genotyped this site as "0/0". Anyway, we will probably move to UnifiedGenotyper - I know you are no longer supporting it (or working on it) but in my understanding HaplotypeCaller has been designed for higher coverage than we have and not for ancient DNA, for which reads can be very short (30 bps) and contain damaged DNA. We have the intuition that the assembly of HC may not work well in this case. I should run a test and compare UG and HC though (I can share the results back if this is of interest for you).


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Good luck, Federico -- if you do run tests, please share the results as this may be helpful for others.
  • fa8@sanger.ac.ukfa8@sanger.ac.uk Sanger InstituteMember

    In case it is of interest to others, here is a very brief summary of the comparison of HC and UG for an ancient DNA sample (18x depth of coverage).

    It seems HC performs better than UG.

    I genotyped chr20 with HC and UG and filtered out variants with Qual<50. The set of HGDP (Human Genome Diversity Project) variants was used as a truth set. It is expected that variants not matching already known variants would be enriched in false positives.

    UG genotypes a bit more variants than HC, but the specificity (HGDP matching percentage) is higher for HC (84.8% vs 86.4%). There is a considerable proportion of variants that are discordant (different genotype in UG and HC): 7973 out of 111368 for UG, 3304 out of 105333 for HC. Discordant variants show lower overlap with HGDP variants (52% UG, 55% HC) and have a much higher proportion of ancient DNA damage substitutions: G-->A and C-->T. This suggests that they are enriched in false positives, as could be expected.

    These are the count for discordant genotypes, showing UG:HC. Null means the site was not genotyped. The most frequent discordance is when UG says het and HC says hom-ref, followed by cases in which one of the callers says het and the other caller makes no call. There is also a considerable number of cases where UG says hom-alt and HC says hom-ref.

     18 0/0:1/1
    136 NULL:1/1
    159 1/1:NULL
    300 0/0:0/1
    386 1/1:0/1
    869 0/1:1/1

    1095 1/1:0/0
    1510 0/1:NULL
    1595 NULL:0/1
    3954 0/1:0/0

    My summary: The overlap of HC with HGDP variants is higher, and HC results in a smaller proportion of discordant variants. That makes me think that HC is doing better than UG. This conclusion may not hold for other ancient samples with different depth of coverage.


Sign In or Register to comment.