HaplotypeCaller --dbsnp

blueskypyblueskypy Member Posts: 261 ✭✭
edited June 2013 in Ask the GATK team

The doc says "dbSNP is not used in any way for the calculations themselves. --dbsnp binds reference ordered data". Does it mean that the determination of whether a locus is a variant is not influenced by whether that variant is present at dbSNP? what does "--dbsnp binds reference ordered data" mean?

Also why isn't there a --indel option?


  • blueskypyblueskypy Member Posts: 261 ✭✭

    Thanks Geraldine for the explanation! But intuitively, wouldn't the verification of a variant calling by dbSNP increase the confidence level of that calling? If so, why wouldn't we use dbSNP to help to make the decision on that call?

    Also you mean I could also add the following to HaplotypeCaller?

    --dbsnp Mills_and_1000G_gold_standard.indels.b37.vcf --dbsnp 1000G_phase1.indels.b37.vcf

  • blueskypyblueskypy Member Posts: 261 ✭✭

    Thanks so much, Geraldine! Have a great weekend!

  • everestial007everestial007 GreensboroMember Posts: 65

    Is it not possible to given --dbsnp argument twice:
    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R lyrata_genome.fa -I realigned_readsMA605.bam --dbsnp filtered_indelsMA605.vcf --dbsnp filtered_snpsMA605.vcf --genotyping_mode DISCOVERY -stand_emit_conf 30 -stand_call_conf 30 -o raw02_variantsMA605.vcf

    I am receiving error output, a part of the output is:

    ERROR MESSAGE: Argument 'dbsnp' has too many values: [org.broadinstitute.gatk.utils.commandline.ArgumentMatchStringValue@412ff43d, org.broadinstitute.gatk.utils.commandline.ArgumentMatchStringValue@334bf23a].

    But, when I provide --dbsnp argument only once (either --dbsnp filtered_indelsMA605.vcf --dbsnp or filtered_snpsMA605.vcf) it runs.
    It important to use -L flag while using BaseRecalibrator. But, how much important is it provide the -L flag when using HaplotypeCaller (while processing -BQSR bootstrapping).

    Thanks in advance !

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,287 admin


    You cannot use the -dbsnp argument more than once, as you have discovered. However, you can combine your two DBSNP files using CombineVariants.

    Have a look at this article for more information on using -L:


  • everestial007everestial007 GreensboroMember Posts: 65

    Thank you Sheila !

  • namsyvonamsyvo University of MemphisMember Posts: 5

    Hi, I have a question about how to interpret information in the dbSNP file that is passed to HaplotypeCaller. Let say I have this line in VCF file:


    20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

    So how is the value 0.5 in AF=0.5 (allele frequency) calculated and what does it mean? Can you give me a specific example so that I can understand it fully and clearly? Thank you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,287 admin


    The AF field gives you the allele frequency of the alternate alleles. In your example above, there is one alternate allele (A). The AF = 0.5 means that the A alternate allele appears at a frequency of 50% in the genotypes. Notice your 3 samples have these genotypes: G/G, G/A, A/A. The A allele has a frequency of 50%.

    I hope this helps!


  • namsyvonamsyvo University of MemphisMember Posts: 5

    Thank you @Sheila for your quick answer. I saw this statement in the VCF v4.2 documentation:
    "AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary
    data, not called genotypes".
    Could you explain me what it means? I'm a little bit confuse about this based on your above explanation.

    One more question, is there any difference between genotype 1|0 and 0|1? Sometimes I saw 1|0, sometimes I saw 0|1. For example, in second sample in my previous example, can I represent the genotype as 0|1 instead of 1|0?

    Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,704 admin

    Regarding the definition of AF, ours may not entirely match the definition provided by the VCF spec. I think the spec recommends using AF to express allele fraction in the read data, whereas we use it to express the frequency in called genotypes. This may be a violation of the intent of the spec, if you take a strict reading of it.

    In your second question, it comes down to the different ways of writing heterozygous genotypes that have been phased. The order of the 0 and the 1 (which represent specific alleles) signifies how those alleles are phased, either within a pedigree or relative to co-location on physical haplotypes. You can't switch the notation without affecting the meaning that this carries.

    Geraldine Van der Auwera, PhD

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,287 admin


    Regarding your phasing question, I found this article to be quite helpful.


