GATK licensing moves to direct-through-Broad model -- read about it on the GATK blog

How should I interpret VCF files produced by the GATK?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin
edited May 13 in FAQs

This document describes "regular" (variants-only) VCF files. For information on the gVCF format produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document.

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from NGS data, such as the UnifiedGenotyper and the HaplotypeCaller, is especially complex. This document describes some specific features and annotations used in the VCF files output by the GATK tools.

2. Basic structure of a VCF file

The following text is a valid VCF file describing the first few SNPs found by the UG in a deep whole genome data set from our favorite test sample, NA12878:

##fileformat=VCFv4.0
##FILTER=<ID=LowQual,Description="QUAL < 50.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model">
##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]"
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
chr1    873762  .       T   G   5231.78 PASS    AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL   0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

It seems a bit complex, but the structure of the file is actually quite simple:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
chr1    873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that here everything is a SNP, but some could be indels or CNVs.

3. How variation is represented

The first 6 columns of the VCF, which represent the observed variation, are easy to understand because they have a single, well-defined meaning.

  • CHROM and POS : The CHROM and POS gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to how indels are represented in a VCF.

  • ID: The dbSNP rs identifier of the SNP, based on the contig and position of the call and whether a record exists at this site in dbSNP.

  • REF and ALT: The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand. For indels the REF and ALT bases always include at least one base each (the base before the event).

  • QUAL: The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. These values can grow very large when a large amount of NGS data is used for variant calling.

  • FILTER: In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis.

For more details about these fields, please see this page.

In the excerpt shown above, here is how we interpret the line corresponding to each variant:

  • chr1:873762 is a novel T/G polymorphism, found with very high confidence (QUAL = 5231.78)
  • chr1:877664 is a known A/G SNP (named rs3828047), found with very high confidence (QUAL = 3931.66)
  • chr1:899282 is a known C/T SNP (named rs28548431), but has a relative low confidence (QUAL = 71.77)
  • chr1:974165 is a known T/C SNP but we have so little evidence for this variant in our data that although we write out a record for it (for book keeping, really) our statistical evidence is so low that we filter the record out as a bad site, as indicated by the "LowQual" annotation.

4. How genotypes are represented

The genotype fields of the VCF look more complicated but they're actually not that hard to interpret once you understand that they're just sets of tags and values. Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

chr1    873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26

Looking at that last column, here is what the tags mean:

  • GT : The genotype of this sample. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

    • 0/0 - the sample is homozygous reference
    • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    • 1/1 - the sample is homozygous alternate
      In the three examples above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
  • GQ: The Genotype Quality, or Phred-scaled confidence that the true genotype is the one provided in GT. In the diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood that the sample is 0/0, 0/1/, or 1/1 under the model built for the NGS dataset. The GQ is simply the second most likely PL - the most likely PL. Because the most likely PL is always 0, GQ = second highest PL - 0. If the second most likely PL is greater than 99, we still assign a GQ of 99, so the highest value of GQ is 99.

  • AD and DP: These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. See the Technical Documentation for details on AD (DepthPerAlleleBySample) and DP (Coverage).

  • PL: This field provides the likelihoods of the given genotypes (here, 0/0, 0/1, and 1/1). These are normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. To be concrete, for the heterozygous case, this is L(data given that the true genotype is 0/1). The most likely genotype (given in the GT field) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype.

With that out of the way, let's interpret the genotypes for NA12878 at chr1:899282.

chr1    899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26

At this site, the called genotype is GT = 0/1, which is C/T. The confidence indicated by GQ = 25.92 isn't so good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0). There's a chance that the subject is "hom-var" (=homozygous with the variant allele) since PL(1/1) = 26, which corresponds to 10^(-2.6), or 0.0025, but either way, it's clear that the subject is definitely not "hom-ref" (=homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.

5. Understanding annotations

Finally, variants in a VCF can be annotated with a variety of additional tags, either by the built-in tools or with others that you add yourself. The way they're formatted is similar to what we saw in the Genotype fields, except instead of being in two separate fields (tags and values, respectively) the annotation tags and values are grouped together, so tag-value pairs are written one after another.

chr1    873762  [CLIPPED] AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473
chr1    877664  [CLIPPED] AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185
chr1    899282  [CLIPPED] AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148

Here are some commonly used built-in annotations and what they mean:

Annotation tag in VCF Meaning
AC,AF,AN See the Technical Documentation for Chromosome Counts.
DB If present, then the variant is in dbSNP.
DP See the Technical Documentation for Coverage.
DS Were any of the samples downsampled because of too much coverage?
Dels See the Technical Documentation for SpanningDeletions.
MQ and MQ0 See the Technical Documentation for RMS Mapping Quality and Mapping Quality Zero.
BaseQualityRankSumTest See the Technical Documentation for Base Quality Rank Sum Test.
MappingQualityRankSumTest See the Technical Documentation for Mapping Quality Rank Sum Test.
ReadPosRankSumTest See the Technical Documentation for Read Position Rank Sum Test.
HRun See the Technical Documentation for Homopolymer Run.
HaplotypeScore See the Technical Documentation for Haplotype Score.
QD See the Technical Documentation for Qual By Depth.
VQSLOD Only present when using Variant quality score recalibration. Log odds ratio of being a true variant versus being false under the trained gaussian mixture model.
FS See the Technical Documentation for Fisher Strand
SB How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls).
Post edited by Sheila on

Geraldine Van der Auwera, PhD

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Geraldine Van der Auwera, PhD

  • HeidihuangHeidihuang ChinaPosts: 18Member

    Hello,
    Is there any information about non-reference counts? In my vcf file, the genotype field has 9 subfields including
    GT:AB AD:DOS:DP:GQ:NALT:NALTT:PL. I am confused by NALT and NALTT. Is NALT will equal to the total numbers of alt alleles in AD field? Or are there any other ideas to get the non-reference counts for every site of every sample?
    Thanks a lot!

    The description of FORMAT field in my vcf file is as follows.

    GT:AB:AD:DOS:DP:GQ:NALT:NALTT:PL

    FORMAT=<ID=AB,Number=1,Type=Float,Description="Allele balance for each het genotype">

    FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

    FORMAT=<ID=DOS,Number=1,Type=Float,Description="Dosage from likelihoods">

    FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">

    FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

    FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

    FORMAT=<ID=NALT,Number=1,Type=Integer,Description="Number of nonreference alleles for this individual">

    FORMAT=<ID=NALTT,Number=1,Type=Integer,Description="Number of nonreference alleles for this individual; thresholded">

    FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">

  • HeidihuangHeidihuang ChinaPosts: 18Member

    @Geraldine_VdAuwera
    Sorry for asking silly questions. In FORMAT field, AD corresponds to non-filtered counts, and DP is filtered. I was wondering which one is corresponding non-reference counts for DP in FORMAT field? If I want one corresponding non-reference counts and depth, no matter it is filtered or not, how can I extract them? Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    @Heidihuang‌ If you want to get the count of all reads that do not have the reference allele, you should take the sum of the AD elements minus the first AD element (which is the reference depth).

    Geraldine Van der Auwera, PhD

  • HasaniHasani GermanyPosts: 22Member
    edited August 2014

    Over 70% of the links in this page are broken..could you please re-point to the right page?

    Broken links are:

    • AD (DepthPerAlleleBySample) and DP (Coverage).

    • AC,AF,AN -> Chromosome Counts.

    • DP -> Coverage.

    • Dels -> SpanningDeletions.

    • MQ and MQ0 -> RMS Mapping Quality and Mapping Quality Zero.

    • BaseQualityRankSumTest -> Base Quality Rank Sum Test.

    • MappingQualityRankSumTest -> Mapping Quality Rank Sum Test.

    • ReadPosRankSumTest -> Read Position Rank Sum Test.

    • HRun -> Homopolymer Run.

    • HaplotypeScore -> Haplotype Score.

    • QD -> Qual By Depth.

    • FS -> Fisher Strand

    Thank you in advance!

    Post edited by Hasani on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Sure, will do. In the meantime you can find those doc pages by going to Guide > Tool Documentation > Annotations

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member

    Hi Geraldine! Recently I used GATK HaplotypeCaller to call SNPs, but I didn't quite understand the output. I used HaplotypeCaller to generate g.vcf files and then genotype these g.vcf files together. Below is a few examples:

    The first is :
    chr1 231884 rs316190603 C T 513.90 PASS AC=2;AF=1.00;AN=2;DB;DP=37;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.62 GT:AD:DP:GQ:PL 1/1:0,12:12:36:540,36,0 ./.:.:5 ./.:.:20
    In this example, I wonder why GATK gives ./.:.:5 and ./.:.:20, assigning no genotypes and allele depth (AD) for these two samples.I examined the read mapping, and found that both the second and the third sample have reads mapped at this site (see the attatchment).

    The second example is :
    chr3 98144609 . A G 318.39 PASS AC=4;AF=0.667;AN=6;DP=0;FS=0.000;MLEAC=4;MLEAF=0.667;MQ=0.00;MQ0=0 GT:GQ:PL 1/1:3:133,3,0 0/1:23:38,0,23 0/1:3:175,0,3
    In this example, why the filed "DP" is 0, and GATK still assign genotypes for the samples?

    Thanks!

    read mapping.png
    1366 x 633 - 16K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Hi @Jack,

    For your first example, can you post the lines from the individual sample GVCFs? you should use the -bamout argument to see what the region looks like after reassembly. It's possible that the reassembly process has moved the reads somewhere else and that the site is no longer covered.

    For your second example, the DP annotation is the filtered depth, so it's possible that there are reads there, but they fail the HC's quality filters. When that happens, the HC will still try to genotype the samples, but you can see from the PL values that there is no real confidence about which genotype is correct.

    Geraldine Van der Auwera, PhD

  • naymikmnaymikm Posts: 6Member

    Hi I am trying to make sense of this low quality snp:
    chr1 1671866 rs7364986 A T,C 0.01 LowQual DB;DP=2;MQ0=0;NS=1 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS 1/1:28.5,NaN:0,2,0,0,0,0:.:.:.
    There are two ALT alleles but here the 1/1 indicates that only the T is being referred to, from my understanding. My question is why is the C ALT allele even listed? Is this just because it is such low quality the software is just noting that a C is just as likely here?

    Thanks,
    Marcus

  • SheilaSheila Broad InstitutePosts: 1,241Member, GATK Dev, Broadie, Moderator, DSDE Member admin

    @naymikm

    Hi Marcus,

    Can you tell me which tool you are using and which version of GATK you are using? Also, please post your exact command line.

    Thanks,
    Sheila

  • thalekothaleko OsloPosts: 1Member
    edited November 2014

    Hi,

    Hope this isn't too stupid a question - I am new to variant calling and have now gone through your RNA-seq VC workflow using my paired end illumina reads and b37 fasta (from your ftp server) as reference. The reads were mapped with the STAR 2-pass approach. I have now reached the variant calling/filtering step. My problem is: The VCF files that are produced only seem to give me variants from chromosome 20. (ie all entries have 20 in the #CHROM field).

    Is there a problem with the reference used? Or am I just interpreting the vcf file all wrong? (If so, where can I find the variants from the remaining 22 chromosomes....?)

    (Edit: I think I figured it out, the culprit is probably the -L 20 argument in the BQSR step... Sorry!)

    Post edited by thaleko on
  • SheilaSheila Broad InstitutePosts: 1,241Member, GATK Dev, Broadie, Moderator, DSDE Member admin
    edited November 2014

    @thaleko

    Hi,

    Can you please post the exact commands you used? Did you use -L in your commands?

    If you used the -L because that is what we have in out tutorials, please note we only include the -L 20 in order to restrict analysis to a chromosome interval for demo purposes. For more details on using -L argument, you can refer here: https://www.broadinstitute.org/gatk/guide/article?id=4133

    -Sheila

    Post edited by Sheila on
  • jullflyjullfly CanadaPosts: 6Member

    Hi,
    I have not been able to find this information anywhere on this website or any of the websites for VCF file documentation, but I apologize if this information is given somewhere. I am wondering what order the ALT alleles (from VCF files made from HaplotypeCaller) are listed in: is ALT1 always the most frequent ALT allele and the other ALT alleles are listed in order of decreasing frequency?
    Clarification would be greatly appreciated.
    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    @jullfly It's a good question, I don't think this is discussed explicitly anywhere. The VCF spec does not impose any logic to the ordering of alternate alleles. I can't speak for other programs, but GATK itself does not incorporate any informative logic to the allele ordering -- I forget if it is alphabetical or some random process, but it does not imply anything meaningful about allele frequency.

    Geraldine Van der Auwera, PhD

  • jullflyjullfly CanadaPosts: 6Member

    @Geraldine_VdAuwera Thanks a lot for your prompt reply! It would be easier if ALT alleles were listed in order of decreasing frequency for me to do some hard filtering with JEXL expressions, but I should still be able to work around it.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    I see what you mean, but from a technical standpoint I can imagine a lot of things going wrong with that, so I don't think it's something we would want to build in. Maybe a separate annotation would do the trick. Now we just need someone to volunteer to write one ;)

    Geraldine Van der Auwera, PhD

  • geoffroyGATKgeoffroyGATK INSERMPosts: 2Member

    Hi,
    This is just for information. In the "What is VCF?" section, the link for VCF specification should be updated.
    The VCF specification is no longer maintained by the 1000 Genomes Project. The group leading the management and expansion of the format is the Global Alliance for Genomics and Health Data Working group file format team, http://ga4gh.org/#/fileformats-team

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Fair point (we are well acquainted with GA4GH); will update those docs accordingly.

    Geraldine Van der Auwera, PhD

  • slowsmileslowsmile New yorkPosts: 2Member

    This page is very helpful but I have a followup question. What is the order in AD and PL if the variant is multi-allelic? in my case, I have
    ref=G
    alt=GA,GAA,GAAA,A

    in one of the samples I have
    GT=1/2,
    AD=7 0 6 27 19 0 0 0 0 0
    PL= 1111 216 215 351 0 516 704 248 419 709 704 248 419 709 709

    What is the order of AD or PL values corresponding to each of the allele case?

  • pdexheimerpdexheimer Posts: 463Member, GATK Dev, DSDE Member mod

    @slowsmile - AD is pretty straightforward, I think it even says in the header. It's ref allele first, followed by the alt alleles in the same order as the ALT column. This doesn't match what you have in your example, though - you only have five alleles, but you show 10 values for AD. Not sure what's going on there.

    PL is far more complex. From the VCF spec: "If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc."

    So in your example, the ordering will be 0/0, 0/1, 1/1, 0/2, 1/2, 2/2, 0/3, 1/3, 2/3, 3/3, 0/4, 1/4, 2/4, 3/4, 4/4

  • SheilaSheila Broad InstitutePosts: 1,241Member, GATK Dev, Broadie, Moderator, DSDE Member admin

    @slowsmile
    Hi,

    Sorry, I just deleted my previous post. I made a mistake, but @pdexheimer answered you correctly above.

    -Sheila

  • slowsmileslowsmile New yorkPosts: 2Member

    @Sheila and @pdexheimer , Thanks both of you. Your answer is very prompt and accurate. It helps me a lot. Best

  • yoyorollsyoyorolls VancouverPosts: 5Member

    Hi,

    I'm new to GATK and I just tried out your best practice for getting variants from RNA-seq data for one human sample. Maybe I'm not understanding the vcf format correctly, but all my variants have either AC= 2, or 1; AF= 1, or 0.5, and all AN = 2. But the sample DP clearly has variants with DP >2. I don't understand why AC and AF info doesn't match up with sample DP. I thought Allelic Count + ref count = sample DP. Also, I don't believe all the varants have allelic frequency of either 1 or 0.5, seeing that it's RNA-seq.

    I followed through the protocol, and did the base recalibration as well. The only difference I can think of is that for base recalibration, I used bed file for snp instead of vcf, and because the program didn't like the way bed formats its position (eg. insertions would have the same start and end sites, entries with start site lower than the end site of previous entry etc), I changed it so that the end site is always start +1 bp. I also didn't do the optional indel realignment. I did notice, that after marking duplicates, it said >80% of reads were filtered... Could that be the problem?

    (RNA-seq was done with ion torrent, and ~30M reads. When mapping using STAR, I used snp-masked genome. In GATK, I used regular unmasked genome)
    Thanks!

  • pdexheimerpdexheimer Posts: 463Member, GATK Dev, DSDE Member mod

    @yoyorolls -

    "Allele Count" (AC) refers to the number of called alleles, not to the number of reads supporting an allele. With a single diploid sample, you're only calling two alleles, and either one or both of them will be variant.

    The number of reads supporting each allele are output in the FORMAT-level AD annotation

  • yoyorollsyoyorolls VancouverPosts: 5Member

    Ok, thanks!

    Also, from previous discussions, it seems that AD is unfiltered while sample-DP is filtered. Is there a way to get the filtered number of reads supporting each allele? Or is it common practice to use the AD numbers to calculate allele frequency?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    We don't currently have official recommendations for calculating allele frequency in RNAseq, but basically, in the current state you would indeed use AD.

    Geraldine Van der Auwera, PhD

  • yoyorollsyoyorolls VancouverPosts: 5Member

    Alright~ thanks!

  • solssols saols, manipalPosts: 3Member

    I have called variants for our RNA-Seq data following the best practice as mentioned. I am unable to interpret some of the entries in the vcf files. I am pasting the examples below.

    chr1 1288459 . TG T 283.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.064;ClippingRankSum=-0.268;DP=50;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=35.71;MQ0=0;MQRankSum=0.268;QD=5.67;ReadPosRankSum=-0.755 GT:AD:DP:GQ:PL 0/1:25,23:48:99:321,0,344

    Here, it states that the variant is heterozygous with the DP 48 and ref count AD 25 and alt count AD 23. Quality looks good and the confidence GQ is also very high.

    Why are two nucleotides mentioned in REF column? How is it heterozygous if its T in REF and T in ALT columns?

    I also find the similar entries with more than two nucleotides in the REF column.

    chr1 160209736 . AAGG A 591.52 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.804;ClippingRankSum=-0.291;DP=44;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=36.53;MQ0=0;MQRankSum=-0.222;QD=13.44;ReadPosRankSum=3.505 GT:AD:DP:GQ:PL 0/1:6,38:44:7:628,0,7

    Lastly, the same behavior is also observed the other way around.

    chr1 1153898 . C CT 66.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.787;ClippingRankSum=-0.208;DP=77;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=45.50;MQ0=0;MQRankSum=1.633;QD=0.87;ReadPosRankSum=-0.659 GT:AD:DP:GQ:PL 0/1:50,23:73:99:104,0,547

    Can you please help me interpret it.

    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Hi @sols, those are simply insertions and deletions (indels). The VCF format specification has some helpful explanations for interpreting these records.

    Geraldine Van der Auwera, PhD

  • SheilaSheila Broad InstitutePosts: 1,241Member, GATK Dev, Broadie, Moderator, DSDE Member admin

    @sols
    Hi,

    The first two examples you posted above are of deletions at the site. The third example you posted is of an insertion at the site.

    In your first example, the reference has TG at position 1288459, but the sample has an alternate allele of T, meaning the G has been deleted.

    In the last example, the reference has a C, but the sample has an alternate allele of CT, meaning there is a T insertion.

    I hope this helps.

    -Sheila

  • solssols saols, manipalPosts: 3Member

    Thank you. I got it

  • MaguelonneMaguelonne ParisPosts: 7Member

    Hi,

    To have full information, I added the "StrandBiasBySample" option for the calling. Since the "StrandBiasBySample" documentation specifies that it gives the "numbers of forward and reverse reads that support REF and ALT alleles", I'm wondering how to deal with a multi-allelic variant. To be sure, if a sample's genotype is 1/2, do the "DP1,DP2,DP3,DP4" given by this option correspond to the numbers of forward and reverse reads that support alternative alleles "1" and "2"?

    Thanks

  • SheilaSheila Broad InstitutePosts: 1,241Member, GATK Dev, Broadie, Moderator, DSDE Member admin

    @Maguelonne
    Hi,

    Unfortunately, StrandBiasBySample only gives the ref-forward, ref-reverse, first alt-forward, first alt-reverse. So, it does not give the two alt alleles.

    The better annotation to use is StrandAlleleCountsBySample. It gives you the forward and reverse counts of all alleles (ref and all alternate alleles) in order.

    -Sheila

  • SV_487SV_487 Washington, D.C.Posts: 1Member

    Hi,

    VCF can be especially unwieldy and inefficient when dealing, for example, with rare variants, or when conducting a variation discovery (i.e. non-genotyping) study. Consider a study with thousands of samples, in which most variants are either rare (seen in only one or two samples) or are defined by just one observation because they are being discovered for the first time. VCF files for these data consist mostly of thousands upon thousands of null genotypes, drastically expanding storage requirements while providing no added informational benefit.

    Instead of using the genotype column format / approach, it would seem to make sense in this case to use INFO tags to list the one (or perhaps two) samples in which a variant was seen, and dispense with the genotype columns. But I don't see provisions for this in the VCF spec.

    What approach would you use if you had to deal primarily with this type of data in VCF? Is there an alternative format or tool of which I am not aware?

    Thank you

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    @SV_487 Currently we're just grinning and bearing it, e.g. for the ExAC dataset the VCF file is just ridiculous, but we don't have a better solution yet. Working on developing something better but it will be a while before that's ready to even discuss (it's exploratory work).

    Geraldine Van der Auwera, PhD

  • freyesfreyes MexicoPosts: 1Member

    Hi, regards from Mexico,

    About VCF text in '2. Basic structure of a VCF file' you explain that first two variants show 'very high confidence' and third variant show 'relative low confidence'. Please tell me are there specific numeric values that constitute the boundaries between 'low confidence' 'relative low confidence', 'confidence', 'very high confidence', etc.?, how could I define these boundaries as objectively as possible? or do these labels are totally subjective?

    Thank you very much in advance for your kind answer.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    @freyes These are subjective judgments based on experience. It's very difficult to define objective boundaries because it depends a lot on the dataset and experiment, and some semi-random technical aspects. That's why we use some filtering tools (see VQSR, variant recalibration in the Best Practices documentation) that use machine learning for filtering variants in a way that balances the tradeoff between sensitivity and specificity.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.