Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

How should I interpret VCF files produced by the GATK?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin
edited April 10 in FAQs

This document describes "regular" (variants-only) VCF files. For information on the gVCF format produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document.

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. See this page for detailed specifications.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from NGS data, such as the UnifiedGenotyper and the HaplotypeCaller, is especially complex. This document describes some specific features and annotations used in the VCF files output by the GATK tools.

2. Basic structure of a VCF file

The following text is a valid VCF file describing the first few SNPs found by the UG in a deep whole genome data set from our favorite test sample, NA12878:

##fileformat=VCFv4.0
##FILTER=<ID=LowQual,Description="QUAL < 50.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model">
##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]"
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
chr1    873762  .       T   G   5231.78 PASS    AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL   0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

It seems a bit complex, but the structure of the file is actually quite simple:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
chr1    873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that here everything is a SNP, but some could be indels or CNVs.

3. How variation is represented

The first 6 columns of the VCF, which represent the observed variation, are easy to understand because they have a single, well-defined meaning.

  • CHROM and POS : The CHROM and POS gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to how indels are represented in a VCF.

  • ID: The dbSNP rs identifier of the SNP, based on the contig and position of the call and whether a record exists at this site in dbSNP.

  • REF and ALT: The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand. For indels the REF and ALT bases always include at least one base each (the base before the event).

  • QUAL: The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. These values can grow very large when a large amount of NGS data is used for variant calling.

  • FILTER: In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis.

For more details about these fields, please see this page.

In the excerpt shown above, here is how we interpret the line corresponding to each variant:

  • chr1:873762 is a novel T/G polymorphism, found with very high confidence (QUAL = 5231.78)
  • chr1:877664 is a known A/G SNP (named rs3828047), found with very high confidence (QUAL = 3931.66)
  • chr1:899282 is a known C/T SNP (named rs28548431), but has a relative low confidence (QUAL = 71.77)
  • chr1:974165 is a known T/C SNP but we have so little evidence for this variant in our data that although we write out a record for it (for book keeping, really) our statistical evidence is so low that we filter the record out as a bad site, as indicated by the "LowQual" annotation.

4. How genotypes are represented

The genotype fields of the VCF look more complicated but they're actually not that hard to interpret once you understand that they're just sets of tags and values. Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

chr1    873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26

Looking at that last column, here is what the tags mean:

  • GT : The genotype of this sample. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

    • 0/0 - the sample is homozygous reference
    • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    • 1/1 - the sample is homozygous alternate In the three examples above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
  • GQ: The Genotype Quality, or Phred-scaled confidence that the true genotype is the one provided in GT. In the diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood that the sample is 0/0, 0/1/, or 1/1 under the model built for the NGS dataset.

  • AD and DP: These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. See the Technical Documentation for details on AD (DepthPerAlleleBySample) and DP (Coverage).

  • PL: This field provides the likelihoods of the given genotypes (here, 0/0, 0/1, and 1/1). These are normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. To be concrete, for the heterozygous case, this is L(data given that the true genotype is 0/1). The most likely genotype (given in the GT field) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype.

With that out of the way, let's interpret the genotypes for NA12878 at chr1:899282.

chr1    899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26

At this site, the called genotype is GT = 0/1, which is C/T. The confidence indicated by GQ = 25.92 isn't so good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0). There's a chance that the subject is "hom-var" (=homozygous with the variant allele) since PL(1/1) = 26, which corresponds to 10^(-2.6), or 0.0025, but either way, it's clear that the subject is definitely not "hom-ref" (=homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.

5. Understanding annotations

Finally, variants in a VCF can be annotated with a variety of additional tags, either by the built-in tools or with others that you add yourself. The way they're formatted is similar to what we saw in the Genotype fields, except instead of being in two separate fields (tags and values, respectively) the annotation tags and values are grouped together, so tag-value pairs are written one after another.

chr1    873762  [CLIPPED] AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473
chr1    877664  [CLIPPED] AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185
chr1    899282  [CLIPPED] AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148

Here are some commonly used built-in annotations and what they mean:

Annotation tag in VCF Meaning
AC,AF,AN See the Technical Documentation for Chromosome Counts.
DB If present, then the variant is in dbSNP.
DP See the Technical Documentation for Coverage.
DS Were any of the samples downsampled because of too much coverage?
Dels See the Technical Documentation for SpanningDeletions.
MQ and MQ0 See the Technical Documentation for RMS Mapping Quality and Mapping Quality Zero.
BaseQualityRankSumTest See the Technical Documentation for Base Quality Rank Sum Test.
MappingQualityRankSumTest See the Technical Documentation for Mapping Quality Rank Sum Test.
ReadPosRankSumTest See the Technical Documentation for Read Position Rank Sum Test.
HRun See the Technical Documentation for Homopolymer Run.
HaplotypeScore See the Technical Documentation for Haplotype Score.
QD See the Technical Documentation for Qual By Depth.
VQSLOD Only present when using Variant quality score recalibration. Log odds ratio of being a true variant versus being false under the trained gaussian mixture model.
FS See the Technical Documentation for Fisher Strand
SB How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls).
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • ZaagZaag Posts: 6Member

    I have ABHet and ABHom in my vcf file and I can not find information about what they stand for. Could anyone explain or give some pointers. Thanks.

  • ebanksebanks Posts: 671GSA Member mod

    If you look at the VCF header it gives the definitions there (e.g. ABHet = Allele Balance for hets (ref/(ref+alt))). Unfortunately, we haven't done a great job documenting non-standard annotations.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • RocketknightRocketknight Posts: 1Member

    Hi, I'm a little confused about the QUAL field for multi-sample VCF. When there are multiple samples, how is the single QUAL value calculated? What exactly does that single value represent in this case?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi there,

    In the method article Using the Unified Genotyper, there is a link to a document that explains the math of the EXACT model, i.e. how the QUAL value is calulated.

    Geraldine Van der Auwera, PhD

  • cjandrasitscjandrasits Posts: 1Member

    Hi there,

    I am a bit confused about the genotype column described here and in the linked VCF documentation. The explanation of the GQ is different. Here you say: "The Genotype Quality, or Phred-scaled confidence that the true genotype is the one provided in GT."

    And in the VCF version 4 document it is explained like this: "GQ genotype quality, encoded as a phred quality -10log_10p(genotype call is wrong) (Numeric)"

    Which one is right now? From the examples given it seems that the second one is right... Or is there a misinterpretation of your explanation from my side?

    Thanks

  • ebanksebanks Posts: 671GSA Member mod

    These are the same things. The VCF definition is the mathematically correct one, while ours is written in "layman's terms."

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • mhtmht Posts: 8Member

    Hi, I have vcf files, but I'm unsure how to determine the genotype of each SNP and I can't find any guide to how to use GQ and PL to obtain the correct genotypes (whether a SNP is truly heterozygous or homozygous). Can someone point me to some function or method as to how to do this? Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin
    edited November 2012

    All that is answered by point 4 of this document.

    How genotypes are represented

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • mhtmht Posts: 8Member

    Oh, I got confused by the example for NA12878 at chr1:899282. I thought from the explanation we still needed to set GQ and PL thresholds to determine exact genotypes. The reason I asked is because I have some really low GQ (some as low as 0) and was wondering if I should filter, and also what threshold to set it at.

    Thks for your help!

  • MattCowpMattCowp Posts: 1Member

    Can you help me understand the first example in section 4 above?

    0/1:173,141:282:99:255,0,255

    Why is there high-confidence (99) in a 0/1 genotype in the GQ field but low confidence (0) in the PL field for 0/1?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi Matt, you are misunderstanding the PL field. The key sentence is here:

    The most likely genotype (given in the GT field) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype.

    Geraldine Van der Auwera, PhD

  • mgymrekmgymrek Posts: 2Member

    Hello,

    In the example above (which is VCF4.0), it is noted that the PL field is only provided when the site is biallelic. In the specification for VCF4.1, the GL and PL fields are defined for multiallelic sites. Can GATK currently handle these?

    If multiallelic sites are supported, is it possible to merge genotypes (using CombineVariants) from different VCF files that have alternate alleles listed in a different order (or even completely different alternate alleles listed). In such cases will the PL field be updated correctly? Or does it require that in each file being merged have exactly the same ALT alleles listed in the same order?

    Thanks, ~M

  • ebanksebanks Posts: 671GSA Member mod

    Thanks, Melissa. I removed the line about the PLs not being produced for sites that aren't bi-allelic (that was from a long, long time ago).

    So, yes, we generally do support multi-allelic sites. But not in CombineVariants; when combining records that do not have the same exact alt alleles, we strip out the PLs. To regenerate them you really need to go back to the BAMs.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • newbie16newbie16 Posts: 15Member

    The QUAL definition here describes it as "probability of Probability of REF/ALT polymorphism". Shouldnt it be just "Probability of REF/ALT polymorphism"? Or am I understanding something wrong?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Yeah, that looks like a copy/paste-induced typo, thanks for pointing it out. Will fix.

    Geraldine Van der Auwera, PhD

  • seb951seb951 Posts: 1Member

    The last part of section 4 is unclear to me because it looks like probabilities are calculated differently for each genotype. Let me explain:

    According to the text, there is a "lack of certainty" in the genotype call because: PL(0/1) = 0, from which it is inferred that there is a "lack of certainty" of this genotype being true. PL(1/1) = 26 = 10^(-2.6) = 0.25%, from which it is inferred that "there's a serious chance that the subject is hom-var" PL(0/0) = 103 = 10^(-10.3) = very small number, from which it is inferred that "the subject is definitely not home-ref".

    But according to me: PL(0/1) = 10^(0) = 100% , from which I infer that the heterozygous reference/alternate genotype IS VERY LIKELY. PL(1/1) = 26 = 10^(-2.6) = 0.25%, from which I infer that it is highly unlikely that it is the homozygous alternate genotype. PL(0/0) = 103 = 10^(-10.3) = very small number, from which I infer that it is highly unlikely that it is the homozygous reference genotype.

    Am I reading this wrong? Can you define genotypes in terms terms of reference / alternate alleles instead of " hom-var", "home-ref" ? and also define a "lack of certainty"?

    Thanks a lot,

    sebastien

    With that out of the way, let's interpret the genotypes for NA12878 at chr1:899282. chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 At this site, the called genotype is GT = 0/1, which is C/T. The confidence (GQ=25.92) isn't so good, largely because there were only a total of 4 reads at this site (DP=4), 1 of which was ref (=had the reference base) and 3 of which were alt (=had the alternate base) (AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value), whereas there's a serious chance that the subject is hom-var (=homozygous with the variant allele) since PL(1/1) = 26 = 10^(-2.6) = 0.25%. Either way, though, it's clear that the subject is definitely not home-ref (=homozygous with the reference allele) here since PL(0/0) = 103 = 10^(-10.3) which is a very small number.

  • s86001924s86001924 Posts: 1Member

    I thought the DP in INFO field means the total coverage. Why in the given example the AD in FORMAT field doesn't add up to the DP in INFO field? Anyone knows? Thank you!!

    chr1 873762 . T G 5231.78 PASS AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    We've explained this again and again in this forum. Look up the meaning of the DP and AD annotations in the Technical Documentation, and the reason should become clear...

    Geraldine Van der Auwera, PhD

  • patidarrpatidarr Posts: 1Member

    Hi Geraldine,

    I got a vcf file from a collaborator from Broad and I am confused on the sample filed: for all the Homozygous Variants SNPs the fiels is like ==> 1/1:0,1:127:99:3262,381,0 and for all the Homozygous Reference SNPS ==> 0/0:1,0:127:99:0,379,2880

    I am wondering why the allelic count is not equal to the total count ? as in the following case:

    0/1:73,78:151:99:1545,0,1003.

    Any Help is highly appreciated.

    Thanks, Rajesh

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    DP is filtered, AD is unfiltered.

    Geraldine Van der Auwera, PhD

  • xhzhaoxhzhao Posts: 6Member

    Hello,

    What does it mean when the Alt field has multiple alleles separated by comma?
    For example:

    chr1 201333017 . CT CTT,C 7591.42 . AC=1,1;AF=0.50,0.50;AN=2;BaseQRankSum=4.656;DP=1528;FS=1.583;HaplotypeScore=9115.9044;MQ=56.76;MQ0=2;MQRankSum=-0.362;QD=4.97;ReadPosRankSum=-0.709;SB=-2050.23 GT:AD:DP:GQ:PL 1/2:889,0,10:1522:99:7591,3624,7034,3292,0,11050

    Does it mean that at this locus, both alleles are possible? Would this only happen when there are indels?

    thanks, Sharon

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi Sharon,

    This means that the program observed two possible alternate alleles in the data. This can happen for any diploid organism; both chromosomes can have different variants that are themselves different from the "normal" reference.

    Please see the VCF specification for more details on how to interpret VCF files.

    http://www.1000genomes.org/wiki/Analysis/Variant Call Format/vcf-variant-call-format-version-41

    Geraldine Van der Auwera, PhD

  • xhzhaoxhzhao Posts: 6Member

    Thanks a lot Geraldine!

    Best regards, Sharon

  • SharonCoxSharonCox Posts: 4Member

    Dear Geraldine, I'm sorry to ask you a silly question but my vcf files generated by Haplotypecaller and UG don't seem to have called the 0/0 gentotype but only 0/1: and 1/1
    e.g.: 0/119,10:99:191,0,484 e.g.: 1/1:0,10:63:941,63,0

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi Sharon,

    Don't worry, there are no silly questions here. By default the callers only output variant sites, so if you are working on a single sample it is normal to not see any hom-ref (0/0) genotypes. If you were working with multiple samples, then you would be likely to see some hom-ref calls, because when there is at least one of the samples that has a variant allele at a position, the caller will output genotypes for all samples at that position even if some are hom-ref.

    If you want the caller to emit genotypes for all positions, whether or not they are variant, use the EMIT_ALL_SITES genotyping mode option (see documentation for more details). Then you will see the hom-ref calls in the vcf too. But be aware that that will produce a much larger vcf file.

    Geraldine Van der Auwera, PhD

  • mpinellimpinelli Posts: 2Member

    Hi Geraldine, I am callind indels in exome data from an Illumina platform.

    In the VCF file many variants has 3 (GT:DP:GQ) instead of 5 (GT:AD:DP:GQ:PL) fields in the result section.

    The VCF file comes from a multisample calling, and variants from a single sample were selected and filtered (-T SelectVariants --excludeNonVariants -ef). They seem to be heterozygotes, so I would expect to see the number of reads with and without the variation. Can you help me to understand the output?

    Example:

    chr1   44451115    rs61096284  A   AGG 1476.36 PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=0.645;DB;DP=3;FS=5.586;InbreedingCoeff=0.0934;MQ=53.62;MQ0=0;MQRankSum=1.502;QD=7.85;RPA=7,9;RU=G;ReadPosRankSum=-1.554;STR;VQSLOD=1.15;culprit=FS  GT:AD:DP:GQ:PL  0/1:1,1:3:12:22,0,12
    chr1    46726876    rs71862873  GT  GTT 1751.27 PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=1.583;DB;DP=19;FS=0.371;InbreedingCoeff=-0.4295;MQ=57.70;MQ0=0;MQRankSum=-0.141;QD=1.59;RPA=13,14,12;RU=T;ReadPosRankSum=0.898;STR;VQSLOD=2.79;culprit=FS   GT:DP:GQ    0/1:19:60
    chr1    47571943    rs10534549  CTTTT   CTTTTT  10614.73    PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=12.681;DB;DP=17;FS=0.687;InbreedingCoeff=0.4971;MQ=53.20;MQ0=0;MQRankSum=-2.061;QD=2.17;RPA=16,17,12,13,15;RU=T;ReadPosRankSum=0.257;STR;VQSLOD=3.54;culprit=FS GT:DP:GQ    0/1:17:28
    chr1    47691568    rs145295250 GCAGA   G   6016.56 PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=6.816;DB;DP=4;FS=0.000;InbreedingCoeff=0.0205;MQ=58.05;MQ0=0;MQRankSum=1.173;QD=7.06;RPA=3,2;RU=CAGA;ReadPosRankSum=1.269;STR;VQSLOD=3.83;culprit=FS    GT:AD:DP:GQ:PL  0/1:2,2:4:99:118,0,124

    Thanks a lot

    Michele

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi Michele,

    Can you tell me if this file was generated directly by one of the callers (UG or HC)? Or is it derived in any way from other files, for example if it results from combining multiple vcf files from separate calling runs?

    Also, what version of gatk are you using?

    Geraldine Van der Auwera, PhD

  • mpinellimpinelli Posts: 2Member

    Hi,

    the results were generated with the UG following the best practices and GATK version was 2.4. It was a multisample calling and then, as I wrote, I selected indels from each single sample and filtered them. I hope this might help.

    Thank you

    Michele

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    I'm not sure why some of the variants would be missing the AD and PL annotations, especially if they were all called together. We had a bug that caused bad AD annotation values, which was fixed in version 2.6, but never any missing annotations. You could try running VariantAnnotator on a small interval around these variants to add them post-hoc, see if that restores them.

    Geraldine Van der Auwera, PhD

  • volaviivolavii Posts: 17Member

    i observed something stange in the field FORMAT (GT:AD:DP:GQ:PL): here the FORMAT fields of four SNPs:

    0/0:8,0:8:24:0,24,324
    0/1:10,6:16:99:189,0,360
    0/0:3,0:3:9:0,9,122
    1/1:0,2:2:6:67,6,0

    In many cases the GQ of the genotype occures in the PL field... But why? The respective value of the GT in PL is 0. First i thougth that happend just accidently, but i have several of these cases.... :/

  • JackJack Posts: 30Member

    I got a vcf file containing the variant chr1 1007394 . CTG C 144.87 . AC=2;AF=1.00;AN=2;DP=5;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=57.52;MQ0=0;QD=14.49;RPA=3,2;RU=TG;STR GT:AD:DP:GQ:PL 1/1:0,5:5:15:182,15,0 I know the "RPA" means the repeats per allele and the "RU" means repeat unit, does it mean the reference sequence is CTGTGTG and the sample sequence is CTGTG ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    @Jack, your interpretation is correct.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 671GSA Member mod

    @volavii: that's not strange, it's just math (unless you think math is strange!). The likelihoods are normalized so that the most likely is set to P=1.0 (in real space) and the GQ calculation is just the likelihoods of the most likely minus the second most likely genotype. In other words, all is well here.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • volaviivolavii Posts: 17Member

    @ebanks: THANKS!! thank explains "everything" :)

  • kmhernankmhernan Posts: 10Member

    Geraldine I am having an issue with my haploid, multi-sample (n=476) VCF file created from the UnifiedGenotyper. I have some cases where the same chromosome:position is in the file, but have a . for the ALT column (I have also added flanking records if it helps):

    chromosome_1    45627   .   T   .
    chromosome_1    45628   .   A   .
    chromosome_1    45628   .   AG  .
    chromosome_1    45629   .   G   .
    

    The only difference in the metadata columns are seen in FORMAT (same order as above):

    GT:DP:MLPSAC:MLPSAF
    GT:DP:MLPSAC:MLPSAF
    GT:AD:DP:MLPSAC:MLPSAF
    GT:DP:MLPSAC:MLPSAF
    

    I don't understand what case this is indicating? Is this some kind of indel? I did use CatVariants because I split GATK runs by chromosome, but I don't think that would cause any kind of collision here. Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Ah, that's because there is one record line for a SNP at that position, and one for an indel starting at that position. This is perfectly normal.

    Geraldine Van der Auwera, PhD

  • kmhernankmhernan Posts: 10Member

    Ah, I guess I thought there always had to be something in the ALT column, but I see that I am wrong! Thanks.

  • JackJack Posts: 30Member

    I'm a little confused about the variant genotype. How should I interpret this:

    chr1    1887726 rs13825147      GTTTT   G       446.73  PASS    AC=2;AF=1.00;AN=2;DB;DP=11;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=
    50.96;MQ0=0;QD=10.15;RPA=7,3;RU=T;STR   GT:AD:DP:GQ:PL  1/1:0,11:11:33:484,33,0
    Does "AC=2" means that the sample has two alleles at this site and the G is the most probably allele ? And if so,how can AF be 1.00,since AC=2?

    One more question about heterozygous indels, can a heterozygous indel be illustrated below?

    GCTA -------strand1

    GC*A -------strand2

    suppose the reference is GCTA, while strand1 and strand2 are positive strand of two homologous chromosomes? Does my interpretation correct?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Does "AC=2" means that the sample has two alleles at this site and the G is the most probably allele ? And if so,how can AF be 1.00,since AC=2?

    AC=2 means that two alleles were considered for this site (the REF allele + 1 ALT allele) but the sample itself was found to be homozygous-variant (1/1), with AF=1.

    Unfortunately I don't understand your second question, can you please rephrase?

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 297Member, GSA Collaborator ✭✭✭

    AC=2 means that two alleles were considered for this site

    I don't think that's quite right, @Geraldine_VdAuwera. AC tells you how many of the chromosomes examined (the number in AN) had an alternate allele. If there were two alternate alleles in the record, AC would have two comma-separated values.

    AF is simply AC/AN (the frequency of the alternate allele in the cohort)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Ack, you're right, @pdexheimer. Thanks for correcting me, I'm not sure what I was thinking of. Let this be a cautionary tale on the perils of dispersing one's attention by responding to forum questions during meetings!

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 30Member

    AC tells you how many of the chromosomes examined (the number in AN) had an alternate allele>

    I'm sorry, I still cannot understand it completely... can you set an example?

    My second question is that I think I have a little difficulty in understanding the concept of homozygous indel and heterozygous indel. When we talk about homozygous and heterozygous genotype, does it always involve two homologous chromosomes? Does a heterzygous indel means one chromosome has the reference allele, while its homologous chromosome has the alternative allele?

    Many thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi Jack,

    I'm sorry but these are very basic questions that are about understanding genetics, not GATK, so I cannot take the time to explain in detail.

    For your first question, please read the VCF format specification, which explains the meaning of all the different fields.

    For your second question, please seek help from a colleague, textbook or one of the many educational resources that you can find online.

    Geraldine Van der Auwera, PhD

  • melilotmelilot Posts: 2Member

    Hi, I have a question on how to interpret the QUAL value. On this page it is stated that it is "The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data." but on the other hand in one of the examples it is said that "a relative low confidence (QUAL = 71.77)" whereas to my understanding, a Phred scaled value of this magnitude is usually a very high value or that in fact, Phred scaled values usually don't even go much above 50 (or there is no point for that). So if this value truly is the Phred scaled probability, and if so, what value can be considered of low or high confidence? Or does this have to be determined for each dataset individually as the value seems to depend on the size of the NGS dataset? Also, could you somehow in laymans terms briefly explain how this value is calculated or what is taken into consideration in the calculation? Is this value useful for filtering or are we better off using the GQ values for example?

    I've also heard that for GQ values, a cutoff of 20 is often used (variants with GQ<20 are of bad quality), do you know what this is based on? Thank you already!

  • melilotmelilot Posts: 2Member

    Addition to my question: We are wondering about the QUAL values in our data since they go up to hundreds of thousands and just didn't seem logical for us to have Phred scaled values this big.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi @melilot,

    You may be thinking of Phred-scaled base qualities, which do indeed tend to be capped at certain values by convention. The various likelihood metrics emitted by GATK, which are also Phred-scaled, can easily go up into the thousands. The important thing to understand is that they are relative, not absolute values. So it does depend very much on the particular dataset you are working with.

    The QUAL value reflects how confident we are that a site displays some kind of variation considering the amount of data available (=depth of coverage at the site) (because we are more confident when we have more observations to rely on), the quality of the mapping of the reads and alignment of the bases (because if we are not sure the bases observed really belong there, they do not contribute much to our confidence), and the quality of the base calls (because if they look like machine errors, they also do not contribute much to our confidence). Filtering on the QUAL value should be done by adjusting the emit and call thresholds at the calling step.

    GQ is a very different metric; it's not about whether the site displays variation. GQ tells you whether, given a site for which there is variation in the population, each sample has been assigned the correct genotype.

    Regarding filtering, have a look at our Best Practices document; the very last section gives some base recommendations for hard-filtering if you cannot perform variant recalibration (VQSR).

    Geraldine Van der Auwera, PhD

  • ykumarykumar Posts: 1Member
    edited September 2013

    Hi I have question I tried VatiantsToTable command to extract various column in Table. I like to extract AD value for my calculation. Unable to extract correct value. It's printing AD column . I used following command

    java -jar ./home/ykumar/app/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -T VariantsToTable  -R ./home/ykumar/ref/GRCh37_order.fa -V ./home/ykumar/Sample11_SNP.vcf -F CHROM -F POS -F ID -F REF -F ALT -F QUAL -F AF -F DP -F QD -F SB -GF GT -GF AD -o ./data/Sample11_AD_SNP.table
    

    its showing like

    CHROM POS          REF ALT    QUAL      AF     DP      QD        SB              GT           AD
    chr1    157101  .       T       C       5002.01 1.00    175     28.58   -6.519e-03      C/C     [[I@29cda59b]
    

    but real value of AD is 2,173

    chr1    157101  .       T       C       5002.01 .       AC=2;AF=1.00;AN=2;BaseQRankSum=-1.134;DP=175;Dels=0.00;FS=0.000;HaplotypeScore=113.2554;MLEAC=2;MLEAF=1.00;MQ=27.39;MQ0=20;MQRankSum=-0.484;QD=28.58;ReadPosRankSum=0.515;SB=-6.519e-03 GT:AD:DP:GQ:PL  1/1:2,173:175:99:5002,393,0
    

    I like to know can we extract AD value as separate column in GATK or not??

    Thank you already!!!

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi @ykumar,

    This looks like it might be a bug in how the value is extracted and represented. Can you confirm that you are using the latest version of GATK? If not, please run this again with the latest version and let me know if the error is still there.

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 36Member

    Hi, How does one interpret ./. instead of being call information? Is it no call?

    Having difficulty finding that answer. Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Yes, that means no-call. We don't document it explicitly because it's not specific to gatk, it's in the vcf specification.

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 36Member

    I see thank you. Is there a "most common" reason for a no call? Apologies that is slightly off topic.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    The only reason gatk issues a no-call is when there is no useable coverage at a site.

    Geraldine Van der Auwera, PhD

  • nreidnreid UC DavisPosts: 2Member

    Hi,

    This has been brought up before, but I find it very confusing. Above, the Genotype Quality is listed as being the:

    ...Phred-scaled confidence that the true genotype is the one provided in GT. In the diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood that the sample is 0/0, 0/1/, or 1/1 under the model built for the NGS dataset.

    After that, @ebanks said:

    The likelihoods are normalized so that the most likely is set to P=1.0 (in real space) and the GQ calculation is just the likelihoods of the most likely minus the second most likely genotype

    These statements are not equivalent using any algebraic contortions I can conjure. I believe something similar to what @ebanks said must be true, that GQ is the likelihood ratio of the second most likely genotype to GT. This is exactly consistent with the observation that the second most likely genotype's phred-scaled likelihood appears as GQ very often. This is not always the case, however, as it seems that when the second most likely genotype has a phred scaled likelihood of greater than 100, GQ is set to 99.

    Furthermore, though I am not a statistical expert, I work with likelihoods in many contexts and I have never heard anyone refer to a likelihood ratio as giving either an error probabililty or the probability that a hypothesis is true as @ebanks did above.

    Can anyone shed light on just what the GQ is, and how it is arrived at? Thanks.

  • ebanksebanks Posts: 671GSA Member mod

    Hi @nreid, 1. Since I personally implemented it I can tell you for sure that the GQ coming out of the GATK is the likelihoods of the most likely genotype minus the likelihoods of the second most likely genotype. The standard has been to cap this value at 99 (and in fact earlier versions of the VCF specification actually put this cap in as a requirement) so that's why you don't see values larger than that.

    1. The official VCF specification defines GQ as the "conditional genotype quality, encoded as a phred quality -10log_10p(genotype call is wrong, conditioned on the site's being variant)". That's what is meant by error (i.e. that the genotype call is wrong).

    Hope this helps.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • miaoxumiaoxu Posts: 1Member
    edited October 2013

    Hi Geraldine, thank you and your collegue so much for developing GATK and sharing information on this platform. You really helps a lot. I'm a newcomer to process sequencing data with GATK. I'm confused with hard filtering variant with INFO field information considering the ultra-high coverage (above 1000X) since I noticed some INFO fields, e.g. FS, change with depth and ploidy. The parameters of hard-filter "best pracitce" suggested may not suitable for filtering variant with ultra-high coverage. I set the -dcov parameter of UG to 1000 and -ploidy to 10 when calling variant. Could you pls help me with the three questions below:

    1. how would QD, MQ, FS, HaplotypeScore, MQRankSum, ReadPosRankSum change with depth?
    2. does ploidy have any affects on these INFO fields?
    3. what threshold of these fields do you suggest I should test with my data? I used the parameters listed below

    For SNP:

    --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" --filterName "HARD_TO_VALIDATE" --filterExpression "QUAL < 50.0" --filterName "VeryLowQual" --filterExpression "QD < 2" --filterName "LowQD" --filterExpression "MQ < 40.0" --filterName "LowMQ" --filterExpression "FS > 200.0" --filterName "StrandBias"

    For INDEL

    --filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" --filterName "HARD_TO_VALIDATE" --filterExpression "QUAL < 30.0" --filterName "VeryLowQual" --filterExpression "QD < 2" --filterName "LowQD" --filterExpression "FS > 200.0" --filterName "StrandBias".

    Although I have read the detailed illustrations many times I still cannot completely understand some of the INFO fields. Pls excuse me if these are simple and naive questions.

    Thank you so much.

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi @miaoxu,

    This is a complicated topic and there are no easy answers, unfortunately. Especially if you are dealing with levels of ploidy that are very different than what we have experience with. You will need to do a lot of experimenting to find the right settings for your data.

    1. The answer to your first question depends a lot on what kind of depth we are talking about. There is a good kind of depth (good quality sequence and/or appropriately mapped) which will increase your confidence in results. Then there is a bad kind of depth (poor quality sequence and/or poorly mapped) which is not useful and will actually dilute your confidence. The annotations will change accordingly; think about it and it should make logical sense. If not, try to read more about what these annotations represent.

    2. Many annotations have restrictions in how applicable they are. Some annotations require the presence of heterozygous data, and some can only be calculated for the diploid case. These restrictions are listed in the technical documentation, which you can find here: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_HaplotypeScore.html

    3. We provide some recommendations here, but they are all based on the assumption of diploidy, so you will need to experiment with them to find what is appropriate for your data.

    Good luck!

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 30Member

    Hi there. I used UG to call SNPs, and then I used SNP chip to validate the callset. It turns out that a lot of my SNPs have different genotypes from chip results. One of the results is as follows: chr21 1621024 rs15181140 G C 253.96 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=1.423;DB;DP=11;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=57.87;MQ0=0;MQRankSum=0.474;QD=23.09;ReadPosRankSum=0.474 GT:AD:DP:GQ:PL 1/1:1,10:11:5:281,5,0 which gives an genotype of CC, while the result from chip result is CG. And there are many similar results, I wonder how should I modify the command argument to make the results concordant to chip results?

    From the PL field, there is a quite high probability that the genotype is CG(0/1 ),isn't it ?

  • ebanksebanks Posts: 671GSA Member mod

    Exactly. The caller has extremely high confidence in the site being non-reference (hence the poor 281 for the GG genotype) but no confidence to distinguish whether the correct genotype is GC or CC; note that the GQ (Genotype Quality or confidence) is Q5 which is extremely small. If you want concordance to chip results you will want to filter out genotypes with GQ < 30. Note that we are actively working on a post-processing step to improve genotypes based on population frequencies, but this won't be ready for a little while.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • chenyu600chenyu600 Posts: 7Member
    edited November 2013

    Hi@Geraldine_VdAuwera

    Seems the different GATK version with different INFO fields, how to set the INFO field with GATK?

    Post edited by chenyu600 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi @chenyu600,

    I'm sorry but I don't understand your question. Can you please rephrase?

    Geraldine Van der Auwera, PhD

  • chenyu600chenyu600 Posts: 7Member
    edited November 2013

    Hi @Geraldine_VdAuweraoh, I mean I want to know how to define the GATK output vcf file's INFO fields

    Post edited by chenyu600 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi @chenyu600,

    If you mean you want to use specific INFO annotations that are not applied by default, the way to do it is to specify -an <desired annotation> in your command line at the calling step, or you can add them afterwards using VariantAnnotator. Have a look at the tool documentation for usage details.

    Geraldine Van der Auwera, PhD

  • nikhil_joshinikhil_joshi Posts: 12Member

    Hello,

    I was looking over this documentation and I have a question about one of the examples above. The VCF line is this:

    chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0

    but the interpretation is this:

    chr1:877664 is a known A/G SNP (named rs3828047), found with very high confidence (QUAL = 3931.66)

    Is this correct? It seems from the VCF line that the SNP call is G/G, not A/G. Am I missing something?

    • Nik.
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi Nik, sorry for the delay.

    The SNP itself is A -> G, and the sample shown here is homozygous variant (G/G) for that SNP. Make sense?

    Geraldine Van der Auwera, PhD

  • cyriaccyriac Posts: 1Member

    In the VCF header, AD is described as ##FORMAT=<ID=AD,Number=.,Type=Integer,... where Number=. indicates that the number of possible values varies, is unknown, or is unbounded. But isn't the number of values equal to the number of genotypes? And per VCF specs, the header should say Number=G instead.

  • pdexheimerpdexheimer Posts: 297Member, GSA Collaborator ✭✭✭

    No, the number of values is equal to the total number of alleles, including reference. It should probably be a Number=R, but that's new to VCF v4.2, and I think the GATK output files are still v4.1. In 4.1, the only possibilities were A (number of alternate alleles) and G. Since it's neither of those, and isn't a fixed number, it had to be 'unknown'

  • flowflow Posts: 14Member

    Hello,

    I have incorporated SnpEff predictions into my VCF using VariantAnnotator. I have the following indel record:

    Chr1 89203 . ATCT A,ATCTTCT 10285.80 . AC=18,15;AF=0.155,0.129;AN=116;BaseQRankSum=0.163;DP=1128;FS=3.274;InbreedingCoeff=0.9781;MLEAC=17,15;MLEAF=0.147,0.129;MQ=58.32;MQ0=0;MQRankSum=-4.942;QD=13.50;RPA=5,4,6;RU=TCT;ReadPosRankSum=-0.067;SNPEFF_EFFECT=FRAME_SHIFT;SNPEFF_EXON_ID=LOC_Os01g01180.1:exon_1;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=mRNA;SNPEFF_GENE_NAME=LOC_Os01g01180;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=LOC_Os01g01180.1;STR GT:AD:DP:GQ:PL 0/0:15,0,0:18:45:0,45,815,45,807,801 ... 2/2:0,0,6:7:18:319,337,504,18,18,0

    As you can see the RPA and RU tags are present indicating a repeat and that SnpEff is predicting a frame shift. This is a single exon gene on the minus strand. The CDS record in the gff is:

    Chr1 MSU_osa1r7 CDS 88986 89204 . - . ID=LOC_Os01g01180.1:cds_1;Parent=LOC_Os01g01180.1

    So, looking at the GFF, the AT at positions 89203-89204 of the reference allele are the first two bases of an ATG codon on the minus strand. Since all three alleles have alleles have an AT (1: A TCT TCT TCT TCT TCT, 2: A TCT TCT TCT TCT and 3: A TCT TCT TCT TCT TCT TCT) I'm not sure why SnpEff is predicting a frameshift. Is this a bug (where, perhaps, the RPA tag is being ignored by snpEff) or is there another explanation?

    Jonathan

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi @flow, that is a question for the authors of SnpEff, since it's not our tool.

    Geraldine Van der Auwera, PhD

  • flowflow Posts: 14Member

    Hi Geraldine,

    True, I understand that, but wondered if other users might be aware of the issue. I know that the GATK team at one point extensively evaluated snpEff and wondered if this was discovered and if GATK developers were aware of it.

    Jonathan

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Fair enough. No, we're not aware of any particular issue on this point. To be honest, to say we "extensively evaluated" SnpEff would be an exaggeration. We looked into several programs and found that SnpEff gave useful results in a format that we could easily work with. We haven't done real in-depth testing, so we can't guarantee it is bug-free. In our experience, no program ever is...

    Geraldine Van der Auwera, PhD

  • flowflow Posts: 14Member

    Thanks Geraldine, I will contact the snpEff developer.

  • flowflow Posts: 14Member

    I clarified with the snpEff developer that snpEff does not presently consider the RPA/RU tags. I suspect this will be problematic for users who incorporate snpEff predictions into Unified Genotyper indel calls that use these tags.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Ah, good to know -- thanks for reporting back with this info. You may want to consider an alternate program then. One of the reasons we chose SnpEff at the time was that few annotation programs produced VCF files, but now that is more common, so others may be just as easy to work with.

    Geraldine Van der Auwera, PhD

  • samshensamshen tampaPosts: 1Member

    Because the Phred scale is -10 * log(1-p),? What is the p? Should it be Phred scale is -10 * log(p),? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    @samshen, I'm sorry but I don't understand your question. Can you please clarify? What do you want to know?

    Geraldine Van der Auwera, PhD

  • Greg_OwensGreg_Owens Posts: 1Member

    A small fraction of my genotypes are strange. Here is an example

    ./. ./. 0/0:17,0:18:45:0,45,353 ./.:.:1 0/0:25,0:25:75:0,75,592 ./.

    I've bolded the genotype that I don't understand what it means. It would suggest that there are no reads for that sample, but I don't know what the one at the end means. At other sites the number is different. This is from a raw vcf file output by GATK.

    Thanks for your help!

  • pdexheimerpdexheimer Posts: 297Member, GSA Collaborator ✭✭✭

    @Greg_Owens‌ - This is how a VCF handles fields with no data. It's confusing, and we've actually addressed it a few times on the forum (here and here, for instance). In the particular case you posted, there's a single low quality read at that site, so nothing is reported in the AD annotation and no genotype is called

Sign In or Register to comment.