Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Problems with dbSNP file using the HaplotypeCaller

ssandmannssandmann Münster, GermanyMember

Hi,

I am having the following problem:
I use the HaplotypeCaller (GATK 3.3.0) for variant calling. To identify variants that are known according to dbSNP, I use the "--dbsnp" statement and define a dbSNP file (vcf file). I thought, that everything would work fine, but by coincidence I observed a (in my eyes really serious) problem: The same call is recognized in the case of one sample, but not in the case of another sample. These are the two important lines of the vcf files that get reported:

17 7579643 . CCCCCAGCCCTCCAGGT C 5066.73 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=4.819;ClippingRankSum=-1.054;DP=231;FS=78.565;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=-0.994;QD=21.93;ReadPosRankSum=-5.473;SOR=1.639;set=variant;EFF=INTRON(MODIFIER||||393|TP53|protein_coding|CODING|ENST00000445888|3|1) GT:AD:DP:GQ:PL 1/1:23,207:230:99:5104,251,0

17 7579643 rs59758982 CCCCCAGCCCTCCAGGT C 2868.73 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=3.120;ClippingRankSum=0.256;DB;DP=134;FS=1.120;MLEAC=2;MLEAF=1.00;MQ=59.91;MQ0=0;MQRankSum=1.849;QD=21.41;ReadPosRankSum=-1.285;SOR=0.704;set=variant;EFF=INTRON(MODIFIER||||393|TP53|protein_coding|CODING|ENST00000445888|3|1) GT:AD:DP:GQ:PL 1/1:13,121:134:96:2906,96,0

As we exclude known variants for our analysis, it is essential that this step works correctly. Yet, I am pretty insecure what to do no. The variant seems to be well known (according to information on the ncbi homepage). Yet, why was it not identified in the other sample???

It would be great if anyone could help me. Many thanks in advance!

Sarah

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Sarah,

    That is odd. I have to ask -- are you absolutely sure that the dbsnp file was provided when that sample was run? Can you check the command line recorded in the VCF header?

    If yes then this could be a bug in HC in that version. You could try re-running again on just that region of that sample to test if this reproduces consistently with the latest version (3.4-46). If it does we'll need test files to debug. Meanwhile a possible workaround is to explicitly re-annotate dbsnp rsIDs using VariantAnnotator as a post-processing step.

  • ssandmannssandmann Münster, GermanyMember

    Dear Geraldine,

    I had a very detailed look at the data and this is what I found out:

    The dbsnp file was definitely provided when the samples were run. Actually, I have got seven samples and they are all analyzed in one big pipeline. In four out of seven cases the call gets recognized, in three cases it does not.

    I checked the dbsnp file, using grep "rs59758982" and this is the output:

    17 7579643 rs59758982 CCCCCAGCCCTCCAGGT C . . GNO;INT;OTHERKG;PM;PMC;RS=59758982;RSPOS=7579669;SAO=0;SLO;SSR=0;VC=DIV;VP=0x050128080001000102000200;WGT=1;dbSNPBuildID=129

    So obviously the variant is there in the data base.

    I used the VariantAnnotator (version 3.3-0) to re-annotate my vcf files. Yet, nothing changed. The call was not recognized in three out of seven cases.

    Subsequently, I installed the latest GATK version (3.4-46). I re-started our pipeline with the new version. The deletion was called in the case of all samples. Yet, the HaplotypeCaller did once again only recognize the mutation in four out of seven cases. The three cases in which it was not recognized are exactly the same as before.

    Again, I used the VariantAnnotator (version 3.4-46) and this time, the variant gets recognized in the remaining three cases as well.

    Sarah

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ssandmann
    Hi Sarah,

    Hmm. I am happy to hear it is annotated with Variant Annotator in the latest version. However, I am still confused why it is not getting annotated with Haplotype Caller. Can you confirm that you used the exact same commands for all 7 samples (except for the input bam)? If so, can you submit a bug report? Instructions are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    -Sheila

  • ssandmannssandmann Münster, GermanyMember

    Dear Sheila,

    I had a look (again) at the command line and it is exactly the same (except for the bam files and the time). I submitted a bug report. You find the folder under "dbSNP-Problem_Sandmann.tar.gz". Everything you need to reproduce the error should be in there. The variant (plus an additional one) is not recognized in the case of Sample1, but it is in the case of Sample2.

    We usually work with a dbSNP file only containing polymorphisms. Yet, the error may also be observed if the normal dbSNP file is used (I also checked that).

    Just tell me in case you need any additional files.

    Many thanks in advance for your help!

    Sarah

Sign In or Register to comment.