The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
Register now for the upcoming GATK Best Practices workshop, Nov 7-8 at the Broad in Cambridge, MA. Open to all comers! More info and signup at

HaplotypeCaller --dbsnp

blueskypyblueskypy Posts: 261Member ✭✭
edited June 2013 in Ask the GATK team

The doc says "dbSNP is not used in any way for the calculations themselves. --dbsnp binds reference ordered data". Does it mean that the determination of whether a locus is a variant is not influenced by whether that variant is present at dbSNP? what does "--dbsnp binds reference ordered data" mean?

Also why isn't there a --indel option?


Best Answers


  • blueskypyblueskypy Posts: 261Member ✭✭

    Thanks Geraldine for the explanation! But intuitively, wouldn't the verification of a variant calling by dbSNP increase the confidence level of that calling? If so, why wouldn't we use dbSNP to help to make the decision on that call?

    Also you mean I could also add the following to HaplotypeCaller?

    --dbsnp Mills_and_1000G_gold_standard.indels.b37.vcf --dbsnp 1000G_phase1.indels.b37.vcf

  • blueskypyblueskypy Posts: 261Member ✭✭

    Thanks so much, Geraldine! Have a great weekend!

  • everestial007everestial007 GreensboroPosts: 62Member

    Is it not possible to given --dbsnp argument twice:
    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R lyrata_genome.fa -I realigned_readsMA605.bam --dbsnp filtered_indelsMA605.vcf --dbsnp filtered_snpsMA605.vcf --genotyping_mode DISCOVERY -stand_emit_conf 30 -stand_call_conf 30 -o raw02_variantsMA605.vcf

    I am receiving error output, a part of the output is:

    ERROR MESSAGE: Argument 'dbsnp' has too many values: [org.broadinstitute.gatk.utils.commandline.ArgumentMatchStringValue@412ff43d, org.broadinstitute.gatk.utils.commandline.ArgumentMatchStringValue@334bf23a].

    But, when I provide --dbsnp argument only once (either --dbsnp filtered_indelsMA605.vcf --dbsnp or filtered_snpsMA605.vcf) it runs.
    It important to use -L flag while using BaseRecalibrator. But, how much important is it provide the -L flag when using HaplotypeCaller (while processing -BQSR bootstrapping).

    Thanks in advance !

  • SheilaSheila Broad InstitutePosts: 4,095Member, Broadie, Moderator, Dev admin


    You cannot use the -dbsnp argument more than once, as you have discovered. However, you can combine your two DBSNP files using CombineVariants.

    Have a look at this article for more information on using -L:


  • everestial007everestial007 GreensboroPosts: 62Member

    Thank you Sheila !

  • namsyvonamsyvo University of MemphisPosts: 5Member

    Hi, I have a question about how to interpret information in the dbSNP file that is passed to HaplotypeCaller. Let say I have this line in VCF file:


    20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

    So how is the value 0.5 in AF=0.5 (allele frequency) calculated and what does it mean? Can you give me a specific example so that I can understand it fully and clearly? Thank you.

  • SheilaSheila Broad InstitutePosts: 4,095Member, Broadie, Moderator, Dev admin


    The AF field gives you the allele frequency of the alternate alleles. In your example above, there is one alternate allele (A). The AF = 0.5 means that the A alternate allele appears at a frequency of 50% in the genotypes. Notice your 3 samples have these genotypes: G/G, G/A, A/A. The A allele has a frequency of 50%.

    I hope this helps!


  • namsyvonamsyvo University of MemphisPosts: 5Member

    Thank you @Sheila for your quick answer. I saw this statement in the VCF v4.2 documentation:
    "AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary
    data, not called genotypes".
    Could you explain me what it means? I'm a little bit confuse about this based on your above explanation.

    One more question, is there any difference between genotype 1|0 and 0|1? Sometimes I saw 1|0, sometimes I saw 0|1. For example, in second sample in my previous example, can I represent the genotype as 0|1 instead of 1|0?

    Thank you.

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,557Administrator, Dev admin

    Regarding the definition of AF, ours may not entirely match the definition provided by the VCF spec. I think the spec recommends using AF to express allele fraction in the read data, whereas we use it to express the frequency in called genotypes. This may be a violation of the intent of the spec, if you take a strict reading of it.

    In your second question, it comes down to the different ways of writing heterozygous genotypes that have been phased. The order of the 0 and the 1 (which represent specific alleles) signifies how those alleles are phased, either within a pedigree or relative to co-location on physical haplotypes. You can't switch the notation without affecting the meaning that this carries.

    Geraldine Van der Auwera, PhD

  • SheilaSheila Broad InstitutePosts: 4,095Member, Broadie, Moderator, Dev admin


    Regarding your phasing question, I found this article to be quite helpful.


Sign In or Register to comment.