Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

HaplotypeCaller --dbsnp

blueskypyblueskypy Member ✭✭
edited June 2013 in Ask the GATK team

The doc says "dbSNP is not used in any way for the calculations themselves. --dbsnp binds reference ordered data". Does it mean that the determination of whether a locus is a variant is not influenced by whether that variant is present at dbSNP? what does "--dbsnp binds reference ordered data" mean?

Also why isn't there a --indel option?


Best Answers


  • blueskypyblueskypy Member ✭✭

    Thanks Geraldine for the explanation! But intuitively, wouldn't the verification of a variant calling by dbSNP increase the confidence level of that calling? If so, why wouldn't we use dbSNP to help to make the decision on that call?

    Also you mean I could also add the following to HaplotypeCaller?

    --dbsnp Mills_and_1000G_gold_standard.indels.b37.vcf --dbsnp 1000G_phase1.indels.b37.vcf

  • blueskypyblueskypy Member ✭✭

    Thanks so much, Geraldine! Have a great weekend!

  • everestial007everestial007 GreensboroMember ✭✭

    Is it not possible to given --dbsnp argument twice:
    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R lyrata_genome.fa -I realigned_readsMA605.bam --dbsnp filtered_indelsMA605.vcf --dbsnp filtered_snpsMA605.vcf --genotyping_mode DISCOVERY -stand_emit_conf 30 -stand_call_conf 30 -o raw02_variantsMA605.vcf

    I am receiving error output, a part of the output is:

    ERROR MESSAGE: Argument 'dbsnp' has too many values: [org[email protected]412ff43d, org[email protected]334bf23a].

    But, when I provide --dbsnp argument only once (either --dbsnp filtered_indelsMA605.vcf --dbsnp or filtered_snpsMA605.vcf) it runs.
    It important to use -L flag while using BaseRecalibrator. But, how much important is it provide the -L flag when using HaplotypeCaller (while processing -BQSR bootstrapping).

    Thanks in advance !

  • SheilaSheila Broad InstituteMember, Broadie admin


    You cannot use the -dbsnp argument more than once, as you have discovered. However, you can combine your two DBSNP files using CombineVariants. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php

    Have a look at this article for more information on using -L: http://gatkforums.broadinstitute.org/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals


  • everestial007everestial007 GreensboroMember ✭✭

    Thank you Sheila !

  • namsyvonamsyvo University of MemphisMember

    Hi, I have a question about how to interpret information in the dbSNP file that is passed to HaplotypeCaller. Let say I have this line in VCF file:


    20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

    So how is the value 0.5 in AF=0.5 (allele frequency) calculated and what does it mean? Can you give me a specific example so that I can understand it fully and clearly? Thank you.

  • SheilaSheila Broad InstituteMember, Broadie admin


    The AF field gives you the allele frequency of the alternate alleles. In your example above, there is one alternate allele (A). The AF = 0.5 means that the A alternate allele appears at a frequency of 50% in the genotypes. Notice your 3 samples have these genotypes: G/G, G/A, A/A. The A allele has a frequency of 50%.

    I hope this helps!


  • namsyvonamsyvo University of MemphisMember

    Thank you @Sheila for your quick answer. I saw this statement in the VCF v4.2 documentation:
    "AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary
    data, not called genotypes".
    Could you explain me what it means? I'm a little bit confuse about this based on your above explanation.

    One more question, is there any difference between genotype 1|0 and 0|1? Sometimes I saw 1|0, sometimes I saw 0|1. For example, in second sample in my previous example, can I represent the genotype as 0|1 instead of 1|0?

    Thank you.

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Regarding the definition of AF, ours may not entirely match the definition provided by the VCF spec. I think the spec recommends using AF to express allele fraction in the read data, whereas we use it to express the frequency in called genotypes. This may be a violation of the intent of the spec, if you take a strict reading of it.

    In your second question, it comes down to the different ways of writing heterozygous genotypes that have been phased. The order of the 0 and the 1 (which represent specific alleles) signifies how those alleles are phased, either within a pedigree or relative to co-location on physical haplotypes. You can't switch the notation without affecting the meaning that this carries.

  • SheilaSheila Broad InstituteMember, Broadie admin


    Regarding your phasing question, I found this article to be quite helpful.


  • @Sheila Sorry to hijack this post, I realize HaplotypeCallerSpark doesn't support dbsnp argument. Are there any workarounds on doing this dbsnp part separately after running the HaplotypeCallerSpark without --dbsnp param? Thank you

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @johnsmith0

    Please try to start new thread when ever possible, especially in this case as it is difficult for us to follow what has happened before and determine the exact issue you are facing. I urge you to please start a new thread for this issue.

    Bhanu Gandham

Sign In or Register to comment.