We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

HaplotypeCaller --dbsnp


We are using the --dbsnp option in HaplotypeCaller. There are several cases where an indel hit against dbsnp appears to not be annotated. For example...

Variant call:
3 33188384 . TA T 455.73 ....

dbsnp entry:
3 33188384 rs57153345 TA T . . OTHERKG;RS=57153345;RSPOS=33188385;SAO=0;SSR=0;U3;VC=DIV;VP=0x050000800001000002000200;WGT=1;dbSNPBuildID=129

Stepping through the code, the variant call alleles do not appear to be normalized resulting in the simple reference field comparison to fail.
alleles=[TAA*, TA]
alleles=[TA*, T]

I believe the unnormalized variant representation is due to other candidate indels at the locus that wound up being filtered upstream.

Can you confirm that this is a bug, or am I missing somethiing?

If this is a bug, is there a workaround?

Thank you.


  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    I am curious to see what answer you will receive to your question. I don't care about dbSNP annotations myself, so I never noticed this issue. One workaround is to split your multiallelics into biallelics, normalize (trim and left align) indels and then annotate with dbSNP. Annotations might not be that important, but I could however see this being a real problem, when defining truth sets for VQSR, but that's a different story. Not sure if you are interested, but Heng Li had an interesting entry on his blog regarding multi allelic sites in VCFs:

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Unfortunately the dbsnp annotation is not entirely reliable nor consistent across annotating tools (e.g. some have slightly different rules, looking at either just the site, or also the alleles to determine matches). The reason little effort has been put into this in GATK is that for calling purposes the dbsnp annotation is only used as a smell test for number of novels vs knowns, and as long as we were dealing with reasonably sized cohorts this worked well enough as an estimator. Now with very large cohorts like Exac we're running into more multiallelic sites, which complicates matters. But so far we've considered this a downstream problem that is not something we have the ability to focus on. I'm not sure that will change in the near future so I would look to other sources for tips & workarounds. Heng Li's post is certainly worth a good read.

  • lmoselmose Member

    Thanks for the feedback. Just to clarify, this is not a multi-allelic site. I believe there where multiple alleles considered upstream, but only one was called.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    I just want to make sure I understand your issue. Are you saying the dbsnp ID column is not populated in the final vcf, and that is making other annotations not appear? Are you using Haplotype Caller in GVCF mode? If so, have you tried using the dbsnp argument with GenotypeGVCFs?


  • lmoselmose Member

    No, the only issue here is that the dbsnp ID column is not populated. No, I'm not using GVCF mode. The issue in this particular case is that HC's internal representation of the indel is not yet normalized when compared to dbSNP. We typically use another tool for annotation downstream anyway, so we'll just use that going forward. Thanks for following up.

Sign In or Register to comment.