It looks like you're new here. If you want to get involved, click one of these buttons!
Geraldine_VdAuwera
Posts: 2,239Administrator, GSA Official Member admin
VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. See this page for detailed specifications.
VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.
That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from NGS data, such as the UnifiedGenotyper and the HaplotypeCaller, is especially complex. This document describes some specific features and annotations used in the VCF files output by the GATK tools.
The following text is a valid VCF file describing the first few SNPs found by the UG in a deep whole genome data set from our favorite test sample, NA12878:
##fileformat=VCFv4.0
##FILTER=<ID=LowQual,Description="QUAL < 50.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model">
##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]"
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
chr1 873762 . T G 5231.78 PASS AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255
chr1 877664 rs3828047 A G 3931.66 PASS AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0
chr1 899282 rs28548431 C T 71.77 PASS AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
chr1 974165 rs9442391 T C 29.84 LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255
It seems a bit complex, but the structure of the file is actually quite simple:
[HEADER LINES]
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255
chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0
chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255
After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that here everything is a SNP, but some could be indels or CNVs.
The first 6 columns of the VCF, which represent the observed variation, are easy to understand because they have a single, well-defined meaning.
CHROM and POS : The CHROM and POS gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to how indels are represented in a VCF.
ID: The dbSNP rs identifier of the SNP, based on the contig and position of the call and whether a record exists at this site in dbSNP.
REF and ALT: The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand. For indels the REF and ALT bases always include at least one base each (the base before the event).
QUAL: The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. These values can grow very large when a large amount of NGS data is used for variant calling.
FILTER: In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis.
For more details about these fields, please see this page.
In the excerpt shown above, here is how we interpret the line corresponding to each variant:
The genotype fields of the VCF look more complicated but they're actually not that hard to interpret once you understand that they're just sets of tags and values. Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:
chr1 873762 . T G [CLIPPED] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255
chr1 877664 rs3828047 A G [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0
chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
Looking at that last column, here is what the tags mean:
GT : The genotype of this sample. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:
GQ: The Genotype Quality, or Phred-scaled confidence that the true genotype is the one provided in GT. In the diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood that the sample is 0/0, 0/1/, or 1/1 under the model built for the NGS dataset.
AD and DP: These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. See the Technical Documentation for details on AD (DepthPerAlleleBySample) and DP (Coverage).
PL: This field provides the likelihoods of the given genotypes (here, 0/0, 0/1, and 1/1). These are normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. To be concrete, for the heterozygous case, this is L(data given that the true genotype is 0/1). The most likely genotype (given in the GT field) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype.
With that out of the way, let's interpret the genotypes for NA12878 at chr1:899282.
chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
At this site, the called genotype is GT = 0/1, which is C/T. The confidence (GQ=25.92) isn't so good, largely because there were only a total of 4 reads at this site (DP=4), 1 of which was ref (=had the reference base) and 3 of which were alt (=had the alternate base) (AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value), whereas there's a serious chance that the subject is hom-var (=homozygous with the variant allele) since PL(1/1) = 26 = 10^(-2.6) = 0.25%. Either way, though, it's clear that the subject is definitely not home-ref (=homozygous with the reference allele) here since PL(0/0) = 103 = 10^(-10.3) which is a very small number.
Finally, variants in a VCF can be annotated with a variety of additional tags, either by the built-in tools or with others that you add yourself. The way they're formatted is similar to what we saw in the Genotype fields, except instead of being in two separate fields (tags and values, respectively) the annotation tags and values are grouped together, so tag-value pairs are written one after another.
chr1 873762 [CLIPPED] AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473
chr1 877664 [CLIPPED] AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185
chr1 899282 [CLIPPED] AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148
Here are some commonly used built-in annotations and what they mean:
| Annotation tag in VCF | Meaning |
|---|---|
| AC,AF,AN | See the Technical Documentation for Chromosome Counts. |
| DB | If present, then the variant is in dbSNP. |
| DP | See the Technical Documentation for Coverage. |
| DS | Were any of the samples downsampled because of too much coverage? |
| Dels | See the Technical Documentation for SpanningDeletions. |
| MQ and MQ0 | See the Technical Documentation for RMS Mapping Quality and Mapping Quality Zero. |
| BaseQualityRankSumTest | See the Technical Documentation for Base Quality Rank Sum Test. |
| MappingQualityRankSumTest | See the Technical Documentation for Mapping Quality Rank Sum Test. |
| ReadPosRankSumTest | See the Technical Documentation for Read Position Rank Sum Test. |
| HRun | See the Technical Documentation for Homopolymer Run. |
| HaplotypeScore | See the Technical Documentation for Haplotype Score. |
| QD | See the Technical Documentation for Qual By Depth. |
| VQSLOD | Only present when using Variant quality score recalibration. Log odds ratio of being a true variant versus being false under the trained gaussian mixture model. |
| FS | See the Technical Documentation for Fisher Strand |
| SB | How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls). |
Geraldine Van der Auwera, PhD
Comments
I have ABHet and ABHom in my vcf file and I can not find information about what they stand for. Could anyone explain or give some pointers. Thanks.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •If you look at the VCF header it gives the definitions there (e.g. ABHet = Allele Balance for hets (ref/(ref+alt))). Unfortunately, we haven't done a great job documenting non-standard annotations.
Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I'm a little confused about the QUAL field for multi-sample VCF. When there are multiple samples, how is the single QUAL value calculated? What exactly does that single value represent in this case?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
In the method article Using the Unified Genotyper, there is a link to a document that explains the math of the EXACT model, i.e. how the QUAL value is calulated.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
I am a bit confused about the genotype column described here and in the linked VCF documentation. The explanation of the GQ is different. Here you say: "The Genotype Quality, or Phred-scaled confidence that the true genotype is the one provided in GT."
And in the VCF version 4 document it is explained like this: "GQ genotype quality, encoded as a phred quality -10log_10p(genotype call is wrong) (Numeric)"
Which one is right now? From the examples given it seems that the second one is right... Or is there a misinterpretation of your explanation from my side?
Thanks
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •These are the same things. The VCF definition is the mathematically correct one, while ours is written in "layman's terms."
Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I have vcf files, but I'm unsure how to determine the genotype of each SNP and I can't find any guide to how to use GQ and PL to obtain the correct genotypes (whether a SNP is truly heterozygous or homozygous). Can someone point me to some function or method as to how to do this? Thanks
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •All that is answered by point 4 of this document.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Oh, I got confused by the example for NA12878 at chr1:899282. I thought from the explanation we still needed to set GQ and PL thresholds to determine exact genotypes. The reason I asked is because I have some really low GQ (some as low as 0) and was wondering if I should filter, and also what threshold to set it at.
Thks for your help!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Can you help me understand the first example in section 4 above?
0/1:173,141:282:99:255,0,255
Why is there high-confidence (99) in a 0/1 genotype in the GQ field but low confidence (0) in the PL field for 0/1?
Thanks!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Matt, you are misunderstanding the PL field. The key sentence is here:
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hello,
In the example above (which is VCF4.0), it is noted that the PL field is only provided when the site is biallelic. In the specification for VCF4.1, the GL and PL fields are defined for multiallelic sites. Can GATK currently handle these?
If multiallelic sites are supported, is it possible to merge genotypes (using CombineVariants) from different VCF files that have alternate alleles listed in a different order (or even completely different alternate alleles listed). In such cases will the PL field be updated correctly? Or does it require that in each file being merged have exactly the same ALT alleles listed in the same order?
Thanks, ~M
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks, Melissa. I removed the line about the PLs not being produced for sites that aren't bi-allelic (that was from a long, long time ago).
So, yes, we generally do support multi-allelic sites. But not in CombineVariants; when combining records that do not have the same exact alt alleles, we strip out the PLs. To regenerate them you really need to go back to the BAMs.
Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •The QUAL definition here describes it as "probability of Probability of REF/ALT polymorphism". Shouldnt it be just "Probability of REF/ALT polymorphism"? Or am I understanding something wrong?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Yeah, that looks like a copy/paste-induced typo, thanks for pointing it out. Will fix.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •The last part of section 4 is unclear to me because it looks like probabilities are calculated differently for each genotype. Let me explain:
According to the text, there is a "lack of certainty" in the genotype call because: PL(0/1) = 0, from which it is inferred that there is a "lack of certainty" of this genotype being true. PL(1/1) = 26 = 10^(-2.6) = 0.25%, from which it is inferred that "there's a serious chance that the subject is hom-var" PL(0/0) = 103 = 10^(-10.3) = very small number, from which it is inferred that "the subject is definitely not home-ref".
But according to me: PL(0/1) = 10^(0) = 100% , from which I infer that the heterozygous reference/alternate genotype IS VERY LIKELY. PL(1/1) = 26 = 10^(-2.6) = 0.25%, from which I infer that it is highly unlikely that it is the homozygous alternate genotype. PL(0/0) = 103 = 10^(-10.3) = very small number, from which I infer that it is highly unlikely that it is the homozygous reference genotype.
Am I reading this wrong? Can you define genotypes in terms terms of reference / alternate alleles instead of " hom-var", "home-ref" ? and also define a "lack of certainty"?
Thanks a lot,
sebastien
With that out of the way, let's interpret the genotypes for NA12878 at chr1:899282. chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26 At this site, the called genotype is GT = 0/1, which is C/T. The confidence (GQ=25.92) isn't so good, largely because there were only a total of 4 reads at this site (DP=4), 1 of which was ref (=had the reference base) and 3 of which were alt (=had the alternate base) (AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value), whereas there's a serious chance that the subject is hom-var (=homozygous with the variant allele) since PL(1/1) = 26 = 10^(-2.6) = 0.25%. Either way, though, it's clear that the subject is definitely not home-ref (=homozygous with the reference allele) here since PL(0/0) = 103 = 10^(-10.3) which is a very small number.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I thought the DP in INFO field means the total coverage. Why in the given example the AD in FORMAT field doesn't add up to the DP in INFO field? Anyone knows? Thank you!!
chr1 873762 . T G 5231.78 PASS AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •We've explained this again and again in this forum. Look up the meaning of the DP and AD annotations in the Technical Documentation, and the reason should become clear...
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine,
I got a vcf file from a collaborator from Broad and I am confused on the sample filed: for all the Homozygous Variants SNPs the fiels is like ==> 1/1:0,1:127:99:3262,381,0 and for all the Homozygous Reference SNPS ==> 0/0:1,0:127:99:0,379,2880
I am wondering why the allelic count is not equal to the total count ? as in the following case:
0/1:73,78:151:99:1545,0,1003.
Any Help is highly appreciated.
Thanks, Rajesh
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •DP is filtered, AD is unfiltered.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hello,
What does it mean when the Alt field has multiple alleles separated by comma?
For example:
chr1 201333017 . CT CTT,C 7591.42 . AC=1,1;AF=0.50,0.50;AN=2;BaseQRankSum=4.656;DP=1528;FS=1.583;HaplotypeScore=9115.9044;MQ=56.76;MQ0=2;MQRankSum=-0.362;QD=4.97;ReadPosRankSum=-0.709;SB=-2050.23 GT:AD:DP:GQ:PL 1/2:889,0,10:1522:99:7591,3624,7034,3292,0,11050
Does it mean that at this locus, both alleles are possible? Would this only happen when there are indels?
thanks, Sharon
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Sharon,
This means that the program observed two possible alternate alleles in the data. This can happen for any diploid organism; both chromosomes can have different variants that are themselves different from the "normal" reference.
Please see the VCF specification for more details on how to interpret VCF files.
http://www.1000genomes.org/wiki/Analysis/Variant Call Format/vcf-variant-call-format-version-41
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks a lot Geraldine!
Best regards, Sharon
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •