Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

Problems with dbSNP file using the HaplotypeCaller

ssandmannssandmann Münster, GermanyMember


I am having the following problem:
I use the HaplotypeCaller (GATK 3.3.0) for variant calling. To identify variants that are known according to dbSNP, I use the "--dbsnp" statement and define a dbSNP file (vcf file). I thought, that everything would work fine, but by coincidence I observed a (in my eyes really serious) problem: The same call is recognized in the case of one sample, but not in the case of another sample. These are the two important lines of the vcf files that get reported:

17 7579643 . CCCCCAGCCCTCCAGGT C 5066.73 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=4.819;ClippingRankSum=-1.054;DP=231;FS=78.565;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=-0.994;QD=21.93;ReadPosRankSum=-5.473;SOR=1.639;set=variant;EFF=INTRON(MODIFIER||||393|TP53|protein_coding|CODING|ENST00000445888|3|1) GT:AD:DP:GQ:PL 1/1:23,207:230:99:5104,251,0

17 7579643 rs59758982 CCCCCAGCCCTCCAGGT C 2868.73 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=3.120;ClippingRankSum=0.256;DB;DP=134;FS=1.120;MLEAC=2;MLEAF=1.00;MQ=59.91;MQ0=0;MQRankSum=1.849;QD=21.41;ReadPosRankSum=-1.285;SOR=0.704;set=variant;EFF=INTRON(MODIFIER||||393|TP53|protein_coding|CODING|ENST00000445888|3|1) GT:AD:DP:GQ:PL 1/1:13,121:134:96:2906,96,0

As we exclude known variants for our analysis, it is essential that this step works correctly. Yet, I am pretty insecure what to do no. The variant seems to be well known (according to information on the ncbi homepage). Yet, why was it not identified in the other sample???

It would be great if anyone could help me. Many thanks in advance!



  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Sarah,

    That is odd. I have to ask -- are you absolutely sure that the dbsnp file was provided when that sample was run? Can you check the command line recorded in the VCF header?

    If yes then this could be a bug in HC in that version. You could try re-running again on just that region of that sample to test if this reproduces consistently with the latest version (3.4-46). If it does we'll need test files to debug. Meanwhile a possible workaround is to explicitly re-annotate dbsnp rsIDs using VariantAnnotator as a post-processing step.

  • ssandmannssandmann Münster, GermanyMember

    Dear Geraldine,

    I had a very detailed look at the data and this is what I found out:

    The dbsnp file was definitely provided when the samples were run. Actually, I have got seven samples and they are all analyzed in one big pipeline. In four out of seven cases the call gets recognized, in three cases it does not.

    I checked the dbsnp file, using grep "rs59758982" and this is the output:

    17 7579643 rs59758982 CCCCCAGCCCTCCAGGT C . . GNO;INT;OTHERKG;PM;PMC;RS=59758982;RSPOS=7579669;SAO=0;SLO;SSR=0;VC=DIV;VP=0x050128080001000102000200;WGT=1;dbSNPBuildID=129

    So obviously the variant is there in the data base.

    I used the VariantAnnotator (version 3.3-0) to re-annotate my vcf files. Yet, nothing changed. The call was not recognized in three out of seven cases.

    Subsequently, I installed the latest GATK version (3.4-46). I re-started our pipeline with the new version. The deletion was called in the case of all samples. Yet, the HaplotypeCaller did once again only recognize the mutation in four out of seven cases. The three cases in which it was not recognized are exactly the same as before.

    Again, I used the VariantAnnotator (version 3.4-46) and this time, the variant gets recognized in the remaining three cases as well.


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi Sarah,

    Hmm. I am happy to hear it is annotated with Variant Annotator in the latest version. However, I am still confused why it is not getting annotated with Haplotype Caller. Can you confirm that you used the exact same commands for all 7 samples (except for the input bam)? If so, can you submit a bug report? Instructions are here:


  • ssandmannssandmann Münster, GermanyMember

    Dear Sheila,

    I had a look (again) at the command line and it is exactly the same (except for the bam files and the time). I submitted a bug report. You find the folder under "dbSNP-Problem_Sandmann.tar.gz". Everything you need to reproduce the error should be in there. The variant (plus an additional one) is not recognized in the case of Sample1, but it is in the case of Sample2.

    We usually work with a dbSNP file only containing polymorphisms. Yet, the error may also be observed if the normal dbSNP file is used (I also checked that).

    Just tell me in case you need any additional files.

    Many thanks in advance for your help!


Sign In or Register to comment.