What is a VCF and how should I interpret it?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin
edited July 29 in FAQs

This document describes "regular" VCF files. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document.


Contents

  1. What is VCF?
  2. Basic structure of a VCF file
  3. Interpreting the VCF file header information
  4. Structure of variant call records
  5. How the genotype and other sample-level information is represented
  6. How to extract information from a VCF in a sane, straightforward way

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specs like SAM/BAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

  • Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.

  • NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned :)

  • Don't write home-brewed VCF parsing scripts. It never ends well.


2. Basic structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

image

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
1   873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:26:103,0,26
1   974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:61:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs (also called SNVs), but other variation could be described, such as indels or CNVs. See the VCF specification for details on how the various types of variations are represented. Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.


3. Interpreting the VCF file header information

The following is a valid VCF header produced by HaplotypeCaller on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself!

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-3-gd1ac142,Date="Mon May 18 17:36:4
.
.
.
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=chr1,length=249250621,assembly=b37>
##reference=file:human_genome_b37.fasta

We're not showing all the lines here, but that's still a lot... so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

  • VCF spec version

The first line:

##fileformat=VCFv4.1

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

  • FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

  • FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation.

  • GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, not just the ones specified explicitly by the user in the command line.

  • Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input bam file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for most organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

[todo: FAQ on genome builds]


4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

  • CHROM and POS : The contig and genomic coordinates on which the variant occurs.
    Note that for deletions the position given is actually the base preceding the event.

  • ID: An optional identifier for the variant.
    Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP.

  • REF and ALT: The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated).
    Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

  • QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data.
    Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic.
    Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

  • FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters.
    If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

This next field does not have to be present in the VCF.

  • INFO: Various site-level annotations.
    The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94.
    They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.


5. How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

1   873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

Looking at that last column, here is what the tags mean:

  • GT : The genotype of this sample at this site.
    For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

    • 0/0 - the sample is homozygous reference
    • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    • 1/1 - the sample is homozygous alternate
      In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
      For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.
  • AD and DP : Allele depth and depth of coverage.
    These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.
    AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.
    DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.
    See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage)for more details.

  • PL : Normalized Phred-scaled likelihoods of the possible genotypes.
    For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are normalized so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale (meaning its P = 1.0 in regular scale). The other values are scaled relative to this most likely genotype.
    Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "likelihood of the genotype", we mean it is "the probability that the genotype is not correct". That's why the smaller the value, the better it is.
    [todo: PL details doc]

  • GQ : Quality of the assigned genotype.
    The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99.
    Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low.
    Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het (heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.


6. How to extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)

basic-vcf-format.png
288 x 287 - 45K
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    Geraldine Van der Auwera, PhD

  • HeidihuangHeidihuang ChinaPosts: 18Member

    Hello,
    Is there any information about non-reference counts? In my vcf file, the genotype field has 9 subfields including
    GT:AB AD:DOS:DP:GQ:NALT:NALTT:PL. I am confused by NALT and NALTT. Is NALT will equal to the total numbers of alt alleles in AD field? Or are there any other ideas to get the non-reference counts for every site of every sample?
    Thanks a lot!

    The description of FORMAT field in my vcf file is as follows.

    GT:AB:AD:DOS:DP:GQ:NALT:NALTT:PL

    FORMAT=<ID=AB,Number=1,Type=Float,Description="Allele balance for each het genotype">

    FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

    FORMAT=<ID=DOS,Number=1,Type=Float,Description="Dosage from likelihoods">

    FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">

    FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

    FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

    FORMAT=<ID=NALT,Number=1,Type=Integer,Description="Number of nonreference alleles for this individual">

    FORMAT=<ID=NALTT,Number=1,Type=Integer,Description="Number of nonreference alleles for this individual; thresholded">

    FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">

  • HeidihuangHeidihuang ChinaPosts: 18Member

    @Geraldine_VdAuwera‌
    Sorry for asking silly questions. In FORMAT field, AD corresponds to non-filtered counts, and DP is filtered. I was wondering which one is corresponding non-reference counts for DP in FORMAT field? If I want one corresponding non-reference counts and depth, no matter it is filtered or not, how can I extract them? Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    @Heidihuang‌ If you want to get the count of all reads that do not have the reference allele, you should take the sum of the AD elements minus the first AD element (which is the reference depth).

    Geraldine Van der Auwera, PhD

  • HasaniHasani GermanyPosts: 27Member
    edited August 2014

    Over 70% of the links in this page are broken..could you please re-point to the right page?

    Broken links are:

    • AD (DepthPerAlleleBySample) and DP (Coverage).

    • AC,AF,AN -> Chromosome Counts.

    • DP -> Coverage.

    • Dels -> SpanningDeletions.

    • MQ and MQ0 -> RMS Mapping Quality and Mapping Quality Zero.

    • BaseQualityRankSumTest -> Base Quality Rank Sum Test.

    • MappingQualityRankSumTest -> Mapping Quality Rank Sum Test.

    • ReadPosRankSumTest -> Read Position Rank Sum Test.

    • HRun -> Homopolymer Run.

    • HaplotypeScore -> Haplotype Score.

    • QD -> Qual By Depth.

    • FS -> Fisher Strand

    Thank you in advance!

    Post edited by Hasani on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    Sure, will do. In the meantime you can find those doc pages by going to Guide > Tool Documentation > Annotations

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member

    Hi Geraldine! Recently I used GATK HaplotypeCaller to call SNPs, but I didn't quite understand the output. I used HaplotypeCaller to generate g.vcf files and then genotype these g.vcf files together. Below is a few examples:

    The first is :
    chr1 231884 rs316190603 C T 513.90 PASS AC=2;AF=1.00;AN=2;DB;DP=37;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.62 GT:AD:DP:GQ:PL 1/1:0,12:12:36:540,36,0 ./.:.:5 ./.:.:20
    In this example, I wonder why GATK gives ./.:.:5 and ./.:.:20, assigning no genotypes and allele depth (AD) for these two samples.I examined the read mapping, and found that both the second and the third sample have reads mapped at this site (see the attatchment).

    The second example is :
    chr3 98144609 . A G 318.39 PASS AC=4;AF=0.667;AN=6;DP=0;FS=0.000;MLEAC=4;MLEAF=0.667;MQ=0.00;MQ0=0 GT:GQ:PL 1/1:3:133,3,0 0/1:23:38,0,23 0/1:3:175,0,3
    In this example, why the filed "DP" is 0, and GATK still assign genotypes for the samples?

    Thanks!

    read mapping.png
    1366 x 633 - 16K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    Hi @Jack,

    For your first example, can you post the lines from the individual sample GVCFs? you should use the -bamout argument to see what the region looks like after reassembly. It's possible that the reassembly process has moved the reads somewhere else and that the site is no longer covered.

    For your second example, the DP annotation is the filtered depth, so it's possible that there are reads there, but they fail the HC's quality filters. When that happens, the HC will still try to genotype the samples, but you can see from the PL values that there is no real confidence about which genotype is correct.

    Geraldine Van der Auwera, PhD

  • naymikmnaymikm Posts: 6Member

    Hi I am trying to make sense of this low quality snp:
    chr1 1671866 rs7364986 A T,C 0.01 LowQual DB;DP=2;MQ0=0;NS=1 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS 1/1:28.5,NaN:0,2,0,0,0,0:.:.:.
    There are two ALT alleles but here the 1/1 indicates that only the T is being referred to, from my understanding. My question is why is the C ALT allele even listed? Is this just because it is such low quality the software is just noting that a C is just as likely here?

    Thanks,
    Marcus

  • SheilaSheila Broad InstitutePosts: 1,992Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @naymikm‌

    Hi Marcus,

    Can you tell me which tool you are using and which version of GATK you are using? Also, please post your exact command line.

    Thanks,
    Sheila

  • thalekothaleko OsloPosts: 1Member
    edited November 2014

    Hi,

    Hope this isn't too stupid a question - I am new to variant calling and have now gone through your RNA-seq VC workflow using my paired end illumina reads and b37 fasta (from your ftp server) as reference. The reads were mapped with the STAR 2-pass approach. I have now reached the variant calling/filtering step. My problem is: The VCF files that are produced only seem to give me variants from chromosome 20. (ie all entries have 20 in the #CHROM field).

    Is there a problem with the reference used? Or am I just interpreting the vcf file all wrong? (If so, where can I find the variants from the remaining 22 chromosomes....?)

    (Edit: I think I figured it out, the culprit is probably the -L 20 argument in the BQSR step... Sorry!)

    Post edited by thaleko on
  • SheilaSheila Broad InstitutePosts: 1,992Member, GATK Dev, Broadie, Moderator, DSDE Dev admin
    edited November 2014

    @thaleko‌

    Hi,

    Can you please post the exact commands you used? Did you use -L in your commands?

    If you used the -L because that is what we have in out tutorials, please note we only include the -L 20 in order to restrict analysis to a chromosome interval for demo purposes. For more details on using -L argument, you can refer here: https://www.broadinstitute.org/gatk/guide/article?id=4133

    -Sheila

    Post edited by Sheila on
  • jullflyjullfly CanadaPosts: 6Member

    Hi,
    I have not been able to find this information anywhere on this website or any of the websites for VCF file documentation, but I apologize if this information is given somewhere. I am wondering what order the ALT alleles (from VCF files made from HaplotypeCaller) are listed in: is ALT1 always the most frequent ALT allele and the other ALT alleles are listed in order of decreasing frequency?
    Clarification would be greatly appreciated.
    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    @jullfly It's a good question, I don't think this is discussed explicitly anywhere. The VCF spec does not impose any logic to the ordering of alternate alleles. I can't speak for other programs, but GATK itself does not incorporate any informative logic to the allele ordering -- I forget if it is alphabetical or some random process, but it does not imply anything meaningful about allele frequency.

    Geraldine Van der Auwera, PhD

  • jullflyjullfly CanadaPosts: 6Member

    @Geraldine_VdAuwera Thanks a lot for your prompt reply! It would be easier if ALT alleles were listed in order of decreasing frequency for me to do some hard filtering with JEXL expressions, but I should still be able to work around it.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    I see what you mean, but from a technical standpoint I can imagine a lot of things going wrong with that, so I don't think it's something we would want to build in. Maybe a separate annotation would do the trick. Now we just need someone to volunteer to write one ;)

    Geraldine Van der Auwera, PhD

  • geoffroyGATKgeoffroyGATK INSERMPosts: 2Member

    Hi,
    This is just for information. In the "What is VCF?" section, the link for VCF specification should be updated.
    The VCF specification is no longer maintained by the 1000 Genomes Project. The group leading the management and expansion of the format is the Global Alliance for Genomics and Health Data Working group file format team, http://ga4gh.org/#/fileformats-team

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    Fair point (we are well acquainted with GA4GH); will update those docs accordingly.

    Geraldine Van der Auwera, PhD

  • slowsmileslowsmile New yorkPosts: 2Member

    This page is very helpful but I have a followup question. What is the order in AD and PL if the variant is multi-allelic? in my case, I have
    ref=G
    alt=GA,GAA,GAAA,A

    in one of the samples I have
    GT=1/2,
    AD=7 0 6 27 19 0 0 0 0 0
    PL= 1111 216 215 351 0 516 704 248 419 709 704 248 419 709 709

    What is the order of AD or PL values corresponding to each of the allele case?

  • pdexheimerpdexheimer Posts: 491Member, GATK Dev, DSDE Dev mod

    @slowsmile - AD is pretty straightforward, I think it even says in the header. It's ref allele first, followed by the alt alleles in the same order as the ALT column. This doesn't match what you have in your example, though - you only have five alleles, but you show 10 values for AD. Not sure what's going on there.

    PL is far more complex. From the VCF spec: "If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc."

    So in your example, the ordering will be 0/0, 0/1, 1/1, 0/2, 1/2, 2/2, 0/3, 1/3, 2/3, 3/3, 0/4, 1/4, 2/4, 3/4, 4/4

  • SheilaSheila Broad InstitutePosts: 1,992Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @slowsmile
    Hi,

    Sorry, I just deleted my previous post. I made a mistake, but @pdexheimer answered you correctly above.

    -Sheila

  • slowsmileslowsmile New yorkPosts: 2Member

    @Sheila and @pdexheimer , Thanks both of you. Your answer is very prompt and accurate. It helps me a lot. Best

  • yoyorollsyoyorolls VancouverPosts: 5Member

    Hi,

    I'm new to GATK and I just tried out your best practice for getting variants from RNA-seq data for one human sample. Maybe I'm not understanding the vcf format correctly, but all my variants have either AC= 2, or 1; AF= 1, or 0.5, and all AN = 2. But the sample DP clearly has variants with DP >2. I don't understand why AC and AF info doesn't match up with sample DP. I thought Allelic Count + ref count = sample DP. Also, I don't believe all the varants have allelic frequency of either 1 or 0.5, seeing that it's RNA-seq.

    I followed through the protocol, and did the base recalibration as well. The only difference I can think of is that for base recalibration, I used bed file for snp instead of vcf, and because the program didn't like the way bed formats its position (eg. insertions would have the same start and end sites, entries with start site lower than the end site of previous entry etc), I changed it so that the end site is always start +1 bp. I also didn't do the optional indel realignment. I did notice, that after marking duplicates, it said >80% of reads were filtered... Could that be the problem?

    (RNA-seq was done with ion torrent, and ~30M reads. When mapping using STAR, I used snp-masked genome. In GATK, I used regular unmasked genome)
    Thanks!

  • pdexheimerpdexheimer Posts: 491Member, GATK Dev, DSDE Dev mod

    @yoyorolls -

    "Allele Count" (AC) refers to the number of called alleles, not to the number of reads supporting an allele. With a single diploid sample, you're only calling two alleles, and either one or both of them will be variant.

    The number of reads supporting each allele are output in the FORMAT-level AD annotation

  • yoyorollsyoyorolls VancouverPosts: 5Member

    Ok, thanks!

    Also, from previous discussions, it seems that AD is unfiltered while sample-DP is filtered. Is there a way to get the filtered number of reads supporting each allele? Or is it common practice to use the AD numbers to calculate allele frequency?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    We don't currently have official recommendations for calculating allele frequency in RNAseq, but basically, in the current state you would indeed use AD.

    Geraldine Van der Auwera, PhD

  • yoyorollsyoyorolls VancouverPosts: 5Member

    Alright~ thanks!

  • solssols saols, manipalPosts: 3Member

    I have called variants for our RNA-Seq data following the best practice as mentioned. I am unable to interpret some of the entries in the vcf files. I am pasting the examples below.

    chr1 1288459 . TG T 283.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.064;ClippingRankSum=-0.268;DP=50;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=35.71;MQ0=0;MQRankSum=0.268;QD=5.67;ReadPosRankSum=-0.755 GT:AD:DP:GQ:PL 0/1:25,23:48:99:321,0,344

    Here, it states that the variant is heterozygous with the DP 48 and ref count AD 25 and alt count AD 23. Quality looks good and the confidence GQ is also very high.

    Why are two nucleotides mentioned in REF column? How is it heterozygous if its T in REF and T in ALT columns?

    I also find the similar entries with more than two nucleotides in the REF column.

    chr1 160209736 . AAGG A 591.52 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.804;ClippingRankSum=-0.291;DP=44;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=36.53;MQ0=0;MQRankSum=-0.222;QD=13.44;ReadPosRankSum=3.505 GT:AD:DP:GQ:PL 0/1:6,38:44:7:628,0,7

    Lastly, the same behavior is also observed the other way around.

    chr1 1153898 . C CT 66.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.787;ClippingRankSum=-0.208;DP=77;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=45.50;MQ0=0;MQRankSum=1.633;QD=0.87;ReadPosRankSum=-0.659 GT:AD:DP:GQ:PL 0/1:50,23:73:99:104,0,547

    Can you please help me interpret it.

    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    Hi @sols, those are simply insertions and deletions (indels). The VCF format specification has some helpful explanations for interpreting these records.

    Geraldine Van der Auwera, PhD

  • SheilaSheila Broad InstitutePosts: 1,992Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @sols
    Hi,

    The first two examples you posted above are of deletions at the site. The third example you posted is of an insertion at the site.

    In your first example, the reference has TG at position 1288459, but the sample has an alternate allele of T, meaning the G has been deleted.

    In the last example, the reference has a C, but the sample has an alternate allele of CT, meaning there is a T insertion.

    I hope this helps.

    -Sheila

  • solssols saols, manipalPosts: 3Member

    Thank you. I got it

  • MaguelonneMaguelonne ParisPosts: 7Member

    Hi,

    To have full information, I added the "StrandBiasBySample" option for the calling. Since the "StrandBiasBySample" documentation specifies that it gives the "numbers of forward and reverse reads that support REF and ALT alleles", I'm wondering how to deal with a multi-allelic variant. To be sure, if a sample's genotype is 1/2, do the "DP1,DP2,DP3,DP4" given by this option correspond to the numbers of forward and reverse reads that support alternative alleles "1" and "2"?

    Thanks

  • SheilaSheila Broad InstitutePosts: 1,992Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @Maguelonne
    Hi,

    Unfortunately, StrandBiasBySample only gives the ref-forward, ref-reverse, first alt-forward, first alt-reverse. So, it does not give the two alt alleles.

    The better annotation to use is StrandAlleleCountsBySample. It gives you the forward and reverse counts of all alleles (ref and all alternate alleles) in order.

    -Sheila

  • SV_487SV_487 Washington, D.C.Posts: 1Member

    Hi,

    VCF can be especially unwieldy and inefficient when dealing, for example, with rare variants, or when conducting a variation discovery (i.e. non-genotyping) study. Consider a study with thousands of samples, in which most variants are either rare (seen in only one or two samples) or are defined by just one observation because they are being discovered for the first time. VCF files for these data consist mostly of thousands upon thousands of null genotypes, drastically expanding storage requirements while providing no added informational benefit.

    Instead of using the genotype column format / approach, it would seem to make sense in this case to use INFO tags to list the one (or perhaps two) samples in which a variant was seen, and dispense with the genotype columns. But I don't see provisions for this in the VCF spec.

    What approach would you use if you had to deal primarily with this type of data in VCF? Is there an alternative format or tool of which I am not aware?

    Thank you

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    @SV_487 Currently we're just grinning and bearing it, e.g. for the ExAC dataset the VCF file is just ridiculous, but we don't have a better solution yet. Working on developing something better but it will be a while before that's ready to even discuss (it's exploratory work).

    Geraldine Van der Auwera, PhD

  • freyesfreyes MexicoPosts: 2Member

    Hi, regards from Mexico,

    About VCF text in '2. Basic structure of a VCF file' you explain that first two variants show 'very high confidence' and third variant show 'relative low confidence'. Please tell me are there specific numeric values that constitute the boundaries between 'low confidence' 'relative low confidence', 'confidence', 'very high confidence', etc.?, how could I define these boundaries as objectively as possible? or do these labels are totally subjective?

    Thank you very much in advance for your kind answer.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    @freyes These are subjective judgments based on experience. It's very difficult to define objective boundaries because it depends a lot on the dataset and experiment, and some semi-random technical aspects. That's why we use some filtering tools (see VQSR, variant recalibration in the Best Practices documentation) that use machine learning for filtering variants in a way that balances the tradeoff between sensitivity and specificity.

    Geraldine Van der Auwera, PhD

  • freyesfreyes MexicoPosts: 2Member

    Thank you very much for your answer Geraldine!, you are very kind!

  • gideonitegideonite Ithaca, NYPosts: 1Member

    The links are still broken on this page.

    Issue · Github
    by Sheila

    Issue Number
    73
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstitutePosts: 1,992Member, GATK Dev, Broadie, Moderator, DSDE Dev admin

    @gideonite
    Hi,

    Sorry about that. I will try to get them fixed soon. In the meantime, you can go here: https://www.broadinstitute.org/gatk/guide/tooldocs/ and look under Variant Annotations for the annotation explanations.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    Tool docs links have been fixed.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,279Administrator, GATK Dev admin

    Also, doc has been updated.

    Geraldine Van der Auwera, PhD

  • vsvintivsvinti Posts: 54Member
    edited August 27

    Hi there, I am using version 3.4-46-gbc02625 to do joint genotyping with HC. My resulting vcf files contain * in the ALT column, which other software programs don't like. Is this a new thing, and what does it mean? (they show up in the snps file when filtered on variant type)

    eg.

    1       789256  rs3131939       T       C,*     17024.46        PASS    AC=8,2;AF=0.400,0.100;AN=20;BaseQRankSum=2.40;ClippingRankSum=0.102;DB;DP=187751;FS=18.346;InbreedingCoeff=-0.3372;MLEAC=9,2;MLEAF=0.450,0.100;MQ=42.05;MQRankSum=4.61;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=25.45;ReadPosRankSum=0.714;SOR=1.113;VQSLOD=-4.416e-01;culprit=MQRankSum
    
    Post edited by vsvinti on
  • ZaagZaag Posts: 14Member

    @vsvinti said:
    Hi there, I am using version 3.4-46-gbc02625 to do joint genotyping with HC. My resulting vcf files contain * in the ALT column, which other software programs don't like. Is this a new thing, and what does it mean? (they show up in the snps file when filtered on variant type)

    eg.

    > 1       789256  rs3131939       T       C,*     17024.46        PASS    AC=8,2;AF=0.400,0.100;AN=20;BaseQRankSum=2.40;ClippingRankSum=0.102;DB;DP=187751;FS=18.346;InbreedingCoeff=-0.3372;MLEAC=9,2;MLEAF=0.450,0.100;MQ=42.05;MQRankSum=4.61;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=25.45;ReadPosRankSum=0.714;SOR=1.113;VQSLOD=-4.416e-01;culprit=MQRankSum
    > 

    I believe it means the SNP lives underneath a deletion from another sample in the vcf.

Sign In or Register to comment.