The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

HaplotypeCaller --dbsnp

blueskypyblueskypy Member
edited June 2013 in Ask the GATK team

The doc says "dbSNP is not used in any way for the calculations themselves. --dbsnp binds reference ordered data". Does it mean that the determination of whether a locus is a variant is not influenced by whether that variant is present at dbSNP? what does "--dbsnp binds reference ordered data" mean?

Also why isn't there a --indel option?

Tagged:

Best Answers

Answers

  • Thanks Geraldine for the explanation! But intuitively, wouldn't the verification of a variant calling by dbSNP increase the confidence level of that calling? If so, why wouldn't we use dbSNP to help to make the decision on that call?

    Also you mean I could also add the following to HaplotypeCaller?

    --dbsnp Mills_and_1000G_gold_standard.indels.b37.vcf --dbsnp 1000G_phase1.indels.b37.vcf

  • Thanks so much, Geraldine! Have a great weekend!

  • everestial007everestial007 GreensboroMember

    Is it not possible to given --dbsnp argument twice:
    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R lyrata_genome.fa -I realigned_readsMA605.bam --dbsnp filtered_indelsMA605.vcf --dbsnp filtered_snpsMA605.vcf --genotyping_mode DISCOVERY -stand_emit_conf 30 -stand_call_conf 30 -o raw02_variantsMA605.vcf

    I am receiving error output, a part of the output is:

    ERROR MESSAGE: Argument 'dbsnp' has too many values: [org.broadinstitute.gatk.utils.commandline.ArgumentMatchStringValue@412ff43d, org.broadinstitute.gatk.utils.commandline.ArgumentMatchStringValue@334bf23a].

    But, when I provide --dbsnp argument only once (either --dbsnp filtered_indelsMA605.vcf --dbsnp or filtered_snpsMA605.vcf) it runs.
    It important to use -L flag while using BaseRecalibrator. But, how much important is it provide the -L flag when using HaplotypeCaller (while processing -BQSR bootstrapping).

    Thanks in advance !

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @everestial007
    Hi,

    You cannot use the -dbsnp argument more than once, as you have discovered. However, you can combine your two DBSNP files using CombineVariants. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php

    Have a look at this article for more information on using -L: http://gatkforums.broadinstitute.org/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals

    -Sheila

  • everestial007everestial007 GreensboroMember

    Thank you Sheila !

  • namsyvonamsyvo University of MemphisMember

    Hi, I have a question about how to interpret information in the dbSNP file that is passed to HaplotypeCaller. Let say I have this line in VCF file:

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

    20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

    So how is the value 0.5 in AF=0.5 (allele frequency) calculated and what does it mean? Can you give me a specific example so that I can understand it fully and clearly? Thank you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @namsyvo
    Hi,

    The AF field gives you the allele frequency of the alternate alleles. In your example above, there is one alternate allele (A). The AF = 0.5 means that the A alternate allele appears at a frequency of 50% in the genotypes. Notice your 3 samples have these genotypes: G/G, G/A, A/A. The A allele has a frequency of 50%.

    I hope this helps!

    -Sheila

  • namsyvonamsyvo University of MemphisMember

    Thank you @Sheila for your quick answer. I saw this statement in the VCF v4.2 documentation:
    "AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary
    data, not called genotypes".
    Could you explain me what it means? I'm a little bit confuse about this based on your above explanation.

    One more question, is there any difference between genotype 1|0 and 0|1? Sometimes I saw 1|0, sometimes I saw 0|1. For example, in second sample in my previous example, can I represent the genotype as 0|1 instead of 1|0?

    Thank you.

    Issue · Github
    by Sheila

    Issue Number
    443
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Regarding the definition of AF, ours may not entirely match the definition provided by the VCF spec. I think the spec recommends using AF to express allele fraction in the read data, whereas we use it to express the frequency in called genotypes. This may be a violation of the intent of the spec, if you take a strict reading of it.

    In your second question, it comes down to the different ways of writing heterozygous genotypes that have been phased. The order of the 0 and the 1 (which represent specific alleles) signifies how those alleles are phased, either within a pedigree or relative to co-location on physical haplotypes. You can't switch the notation without affecting the meaning that this carries.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @namsyvo
    Hi,

    Regarding your phasing question, I found this article to be quite helpful.

    -Sheila

Sign In or Register to comment.