The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# What is a VCF and how should I interpret it?

edited July 2016 in FAQs

#### Contents

1. What is VCF?
2. Basic structure of a VCF file
3. Interpreting the VCF file header information
4. Structure of variant call records
5. How the genotype and other sample-level information is represented
6. How to extract information from a VCF in a sane, straightforward way

### 1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specs like SAM/BAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

• Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.

• NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned

• Don't write home-brewed VCF parsing scripts. It never ends well.

### 2. Basic structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
1   873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:26:103,0,26
1   974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:61:61,0,255


After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs (also called SNVs), but other variation could be described, such as indels or CNVs. See the VCF specification for details on how the various types of variations are represented. Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.

### 3. Interpreting the VCF file header information

The following is a valid VCF header produced by HaplotypeCaller on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself!

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-3-gd1ac142,Date="Mon May 18 17:36:4
.
.
.
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=chr1,length=249250621,assembly=b37>
##reference=file:human_genome_b37.fasta


We're not showing all the lines here, but that's still a lot... so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

• VCF spec version

The first line:

##fileformat=VCFv4.1


tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

• FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">


Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

• FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation.

• GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, not just the ones specified explicitly by the user in the command line.

• Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input bam file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for most organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

[todo: FAQ on genome builds]

### 4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]


The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

#### Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

• CHROM and POS : The contig and genomic coordinates on which the variant occurs.
Note that for deletions the position given is actually the base preceding the event.

• ID: An optional identifier for the variant.
Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP.

• REF and ALT: The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated).
Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

• QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data.
Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic.
Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

• FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters.
If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

This next field does not have to be present in the VCF.

• INFO: Various site-level annotations.
The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94.
They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

#### Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]


All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.

### 5. How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

1   873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26


Looking at that last column, here is what the tags mean:

• GT : The genotype of this sample at this site.
For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

• 0/0 - the sample is homozygous reference
• 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
• 1/1 - the sample is homozygous alternate
In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.
• AD and DP : Allele depth and depth of coverage.
These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.
AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.
DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.
See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.

• PL : Normalized Phred-scaled likelihoods of the possible genotypes.
For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are normalized so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale (meaning its P = 1.0 in regular scale). The other values are scaled relative to this most likely genotype.
Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "likelihood of the genotype", we mean it is "the probability that the genotype is not correct". That's why the smaller the value, the better it is.
[todo: PL details doc]

• GQ : Quality of the assigned genotype.
The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99.
Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low.
Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26


At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het (heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.

### 6. How to extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)

Geraldine Van der Auwera, PhD

Post edited by Geraldine_VdAuwera on
Tagged:

Geraldine Van der Auwera, PhD

• ChinaMember Posts: 18

Hello,
Is there any information about non-reference counts? In my vcf file, the genotype field has 9 subfields including
GT:AB AD:DOS:DP:GQ:NALT:NALTT:PL. I am confused by NALT and NALTT. Is NALT will equal to the total numbers of alt alleles in AD field? Or are there any other ideas to get the non-reference counts for every site of every sample?
Thanks a lot!

The description of FORMAT field in my vcf file is as follows.

## FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">

• ChinaMember Posts: 18

@Geraldine_VdAuwera‌
Sorry for asking silly questions. In FORMAT field, AD corresponds to non-filtered counts, and DP is filtered. I was wondering which one is corresponding non-reference counts for DP in FORMAT field? If I want one corresponding non-reference counts and depth, no matter it is filtered or not, how can I extract them? Thanks a lot!

@Heidihuang‌ If you want to get the count of all reads that do not have the reference allele, you should take the sum of the AD elements minus the first AD element (which is the reference depth).

Geraldine Van der Auwera, PhD

• GermanyMember Posts: 30
edited August 2014

Over 70% of the links in this page are broken..could you please re-point to the right page?

• AD (DepthPerAlleleBySample) and DP (Coverage).

• AC,AF,AN -> Chromosome Counts.

• DP -> Coverage.

• Dels -> SpanningDeletions.

• MQ and MQ0 -> RMS Mapping Quality and Mapping Quality Zero.

• BaseQualityRankSumTest -> Base Quality Rank Sum Test.

• MappingQualityRankSumTest -> Mapping Quality Rank Sum Test.

• ReadPosRankSumTest -> Read Position Rank Sum Test.

• HRun -> Homopolymer Run.

• HaplotypeScore -> Haplotype Score.

• QD -> Qual By Depth.

• FS -> Fisher Strand

Thank you in advance!

Sure, will do. In the meantime you can find those doc pages by going to Guide > Tool Documentation > Annotations

Geraldine Van der Auwera, PhD

• Member Posts: 36

Hi Geraldine! Recently I used GATK HaplotypeCaller to call SNPs, but I didn't quite understand the output. I used HaplotypeCaller to generate g.vcf files and then genotype these g.vcf files together. Below is a few examples:

The first is :
chr1 231884 rs316190603 C T 513.90 PASS AC=2;AF=1.00;AN=2;DB;DP=37;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.62 GT:AD:DP:GQ:PL 1/1:0,12:12:36:540,36,0 ./.:.:5 ./.:.:20
In this example, I wonder why GATK gives ./.:.:5 and ./.:.:20, assigning no genotypes and allele depth (AD) for these two samples.I examined the read mapping, and found that both the second and the third sample have reads mapped at this site (see the attatchment).

The second example is :
chr3 98144609 . A G 318.39 PASS AC=4;AF=0.667;AN=6;DP=0;FS=0.000;MLEAC=4;MLEAF=0.667;MQ=0.00;MQ0=0 GT:GQ:PL 1/1:3:133,3,0 0/1:23:38,0,23 0/1:3:175,0,3
In this example, why the filed "DP" is 0, and GATK still assign genotypes for the samples?

Thanks!

Hi @Jack,

For your first example, can you post the lines from the individual sample GVCFs? you should use the -bamout argument to see what the region looks like after reassembly. It's possible that the reassembly process has moved the reads somewhere else and that the site is no longer covered.

For your second example, the DP annotation is the filtered depth, so it's possible that there are reads there, but they fail the HC's quality filters. When that happens, the HC will still try to genotype the samples, but you can see from the PL values that there is no real confidence about which genotype is correct.

Geraldine Van der Auwera, PhD

• Member Posts: 6

Hi I am trying to make sense of this low quality snp:
chr1 1671866 rs7364986 A T,C 0.01 LowQual DB;DP=2;MQ0=0;NS=1 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS 1/1:28.5,NaN:0,2,0,0,0,0:.:.:.
There are two ALT alleles but here the 1/1 indicates that only the T is being referred to, from my understanding. My question is why is the C ALT allele even listed? Is this just because it is such low quality the software is just noting that a C is just as likely here?

Thanks,
Marcus

@naymikm‌

Hi Marcus,

Can you tell me which tool you are using and which version of GATK you are using? Also, please post your exact command line.

Thanks,
Sheila

• OsloMember Posts: 1
edited November 2014

Hi,

Hope this isn't too stupid a question - I am new to variant calling and have now gone through your RNA-seq VC workflow using my paired end illumina reads and b37 fasta (from your ftp server) as reference. The reads were mapped with the STAR 2-pass approach. I have now reached the variant calling/filtering step. My problem is: The VCF files that are produced only seem to give me variants from chromosome 20. (ie all entries have 20 in the #CHROM field).

Is there a problem with the reference used? Or am I just interpreting the vcf file all wrong? (If so, where can I find the variants from the remaining 22 chromosomes....?)

(Edit: I think I figured it out, the culprit is probably the -L 20 argument in the BQSR step... Sorry!)

edited November 2014

@thaleko‌

Hi,

Can you please post the exact commands you used? Did you use -L in your commands?

If you used the -L because that is what we have in out tutorials, please note we only include the -L 20 in order to restrict analysis to a chromosome interval for demo purposes. For more details on using -L argument, you can refer here: https://www.broadinstitute.org/gatk/guide/article?id=4133

-Sheila

• CanadaMember Posts: 6

Hi,
I have not been able to find this information anywhere on this website or any of the websites for VCF file documentation, but I apologize if this information is given somewhere. I am wondering what order the ALT alleles (from VCF files made from HaplotypeCaller) are listed in: is ALT1 always the most frequent ALT allele and the other ALT alleles are listed in order of decreasing frequency?
Clarification would be greatly appreciated.
Thanks

@jullfly It's a good question, I don't think this is discussed explicitly anywhere. The VCF spec does not impose any logic to the ordering of alternate alleles. I can't speak for other programs, but GATK itself does not incorporate any informative logic to the allele ordering -- I forget if it is alphabetical or some random process, but it does not imply anything meaningful about allele frequency.

Geraldine Van der Auwera, PhD

• CanadaMember Posts: 6

@Geraldine_VdAuwera Thanks a lot for your prompt reply! It would be easier if ALT alleles were listed in order of decreasing frequency for me to do some hard filtering with JEXL expressions, but I should still be able to work around it.

I see what you mean, but from a technical standpoint I can imagine a lot of things going wrong with that, so I don't think it's something we would want to build in. Maybe a separate annotation would do the trick. Now we just need someone to volunteer to write one

Geraldine Van der Auwera, PhD

• INSERMMember Posts: 2

Hi,
This is just for information. In the "What is VCF?" section, the link for VCF specification should be updated.
The VCF specification is no longer maintained by the 1000 Genomes Project. The group leading the management and expansion of the format is the Global Alliance for Genomics and Health Data Working group file format team, http://ga4gh.org/#/fileformats-team

Fair point (we are well acquainted with GA4GH); will update those docs accordingly.

Geraldine Van der Auwera, PhD

• New yorkMember Posts: 2

This page is very helpful but I have a followup question. What is the order in AD and PL if the variant is multi-allelic? in my case, I have
ref=G
alt=GA,GAA,GAAA,A

in one of the samples I have
GT=1/2,
AD=7 0 6 27 19 0 0 0 0 0
PL= 1111 216 215 351 0 516 704 248 419 709 704 248 419 709 709

What is the order of AD or PL values corresponding to each of the allele case?

• Member, Dev Posts: 544 ✭✭✭✭

@slowsmile - AD is pretty straightforward, I think it even says in the header. It's ref allele first, followed by the alt alleles in the same order as the ALT column. This doesn't match what you have in your example, though - you only have five alleles, but you show 10 values for AD. Not sure what's going on there.

PL is far more complex. From the VCF spec: "If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc."

So in your example, the ordering will be 0/0, 0/1, 1/1, 0/2, 1/2, 2/2, 0/3, 1/3, 2/3, 3/3, 0/4, 1/4, 2/4, 3/4, 4/4

@slowsmile
Hi,

Sorry, I just deleted my previous post. I made a mistake, but @pdexheimer answered you correctly above.

-Sheila

• New yorkMember Posts: 2

@Sheila and @pdexheimer , Thanks both of you. Your answer is very prompt and accurate. It helps me a lot. Best

• VancouverMember Posts: 5

Hi,

I'm new to GATK and I just tried out your best practice for getting variants from RNA-seq data for one human sample. Maybe I'm not understanding the vcf format correctly, but all my variants have either AC= 2, or 1; AF= 1, or 0.5, and all AN = 2. But the sample DP clearly has variants with DP >2. I don't understand why AC and AF info doesn't match up with sample DP. I thought Allelic Count + ref count = sample DP. Also, I don't believe all the varants have allelic frequency of either 1 or 0.5, seeing that it's RNA-seq.

I followed through the protocol, and did the base recalibration as well. The only difference I can think of is that for base recalibration, I used bed file for snp instead of vcf, and because the program didn't like the way bed formats its position (eg. insertions would have the same start and end sites, entries with start site lower than the end site of previous entry etc), I changed it so that the end site is always start +1 bp. I also didn't do the optional indel realignment. I did notice, that after marking duplicates, it said >80% of reads were filtered... Could that be the problem?

(RNA-seq was done with ion torrent, and ~30M reads. When mapping using STAR, I used snp-masked genome. In GATK, I used regular unmasked genome)
Thanks!

• Member, Dev Posts: 544 ✭✭✭✭

"Allele Count" (AC) refers to the number of called alleles, not to the number of reads supporting an allele. With a single diploid sample, you're only calling two alleles, and either one or both of them will be variant.

The number of reads supporting each allele are output in the FORMAT-level AD annotation

• VancouverMember Posts: 5

Ok, thanks!

Also, from previous discussions, it seems that AD is unfiltered while sample-DP is filtered. Is there a way to get the filtered number of reads supporting each allele? Or is it common practice to use the AD numbers to calculate allele frequency?

We don't currently have official recommendations for calculating allele frequency in RNAseq, but basically, in the current state you would indeed use AD.

Geraldine Van der Auwera, PhD

• VancouverMember Posts: 5

Alright~ thanks!

• saols, manipalMember Posts: 3

I have called variants for our RNA-Seq data following the best practice as mentioned. I am unable to interpret some of the entries in the vcf files. I am pasting the examples below.

chr1 1288459 . TG T 283.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.064;ClippingRankSum=-0.268;DP=50;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=35.71;MQ0=0;MQRankSum=0.268;QD=5.67;ReadPosRankSum=-0.755 GT:AD:DP:GQ:PL 0/1:25,23:48:99:321,0,344

Here, it states that the variant is heterozygous with the DP 48 and ref count AD 25 and alt count AD 23. Quality looks good and the confidence GQ is also very high.

Why are two nucleotides mentioned in REF column? How is it heterozygous if its T in REF and T in ALT columns?

I also find the similar entries with more than two nucleotides in the REF column.

chr1 160209736 . AAGG A 591.52 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.804;ClippingRankSum=-0.291;DP=44;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=36.53;MQ0=0;MQRankSum=-0.222;QD=13.44;ReadPosRankSum=3.505 GT:AD:DP:GQ:PL 0/1:6,38:44:7:628,0,7

Lastly, the same behavior is also observed the other way around.

chr1 1153898 . C CT 66.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.787;ClippingRankSum=-0.208;DP=77;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=45.50;MQ0=0;MQRankSum=1.633;QD=0.87;ReadPosRankSum=-0.659 GT:AD:DP:GQ:PL 0/1:50,23:73:99:104,0,547

Thanks.

Hi @sols, those are simply insertions and deletions (indels). The VCF format specification has some helpful explanations for interpreting these records.

Geraldine Van der Auwera, PhD

@sols
Hi,

The first two examples you posted above are of deletions at the site. The third example you posted is of an insertion at the site.

In your first example, the reference has TG at position 1288459, but the sample has an alternate allele of T, meaning the G has been deleted.

In the last example, the reference has a C, but the sample has an alternate allele of CT, meaning there is a T insertion.

I hope this helps.

-Sheila

• saols, manipalMember Posts: 3

Thank you. I got it

• ParisMember Posts: 9

Hi,

To have full information, I added the "StrandBiasBySample" option for the calling. Since the "StrandBiasBySample" documentation specifies that it gives the "numbers of forward and reverse reads that support REF and ALT alleles", I'm wondering how to deal with a multi-allelic variant. To be sure, if a sample's genotype is 1/2, do the "DP1,DP2,DP3,DP4" given by this option correspond to the numbers of forward and reverse reads that support alternative alleles "1" and "2"?

Thanks

@Maguelonne
Hi,

Unfortunately, StrandBiasBySample only gives the ref-forward, ref-reverse, first alt-forward, first alt-reverse. So, it does not give the two alt alleles.

The better annotation to use is StrandAlleleCountsBySample. It gives you the forward and reverse counts of all alleles (ref and all alternate alleles) in order.

-Sheila

• Washington, D.C.Member Posts: 1

Hi,

VCF can be especially unwieldy and inefficient when dealing, for example, with rare variants, or when conducting a variation discovery (i.e. non-genotyping) study. Consider a study with thousands of samples, in which most variants are either rare (seen in only one or two samples) or are defined by just one observation because they are being discovered for the first time. VCF files for these data consist mostly of thousands upon thousands of null genotypes, drastically expanding storage requirements while providing no added informational benefit.

Instead of using the genotype column format / approach, it would seem to make sense in this case to use INFO tags to list the one (or perhaps two) samples in which a variant was seen, and dispense with the genotype columns. But I don't see provisions for this in the VCF spec.

What approach would you use if you had to deal primarily with this type of data in VCF? Is there an alternative format or tool of which I am not aware?

Thank you

@SV_487 Currently we're just grinning and bearing it, e.g. for the ExAC dataset the VCF file is just ridiculous, but we don't have a better solution yet. Working on developing something better but it will be a while before that's ready to even discuss (it's exploratory work).

Geraldine Van der Auwera, PhD

• MexicoMember Posts: 2

Hi, regards from Mexico,

About VCF text in '2. Basic structure of a VCF file' you explain that first two variants show 'very high confidence' and third variant show 'relative low confidence'. Please tell me are there specific numeric values that constitute the boundaries between 'low confidence' 'relative low confidence', 'confidence', 'very high confidence', etc.?, how could I define these boundaries as objectively as possible? or do these labels are totally subjective?

Thank you very much in advance for your kind answer.

@freyes These are subjective judgments based on experience. It's very difficult to define objective boundaries because it depends a lot on the dataset and experiment, and some semi-random technical aspects. That's why we use some filtering tools (see VQSR, variant recalibration in the Best Practices documentation) that use machine learning for filtering variants in a way that balances the tradeoff between sensitivity and specificity.

Geraldine Van der Auwera, PhD

• MexicoMember Posts: 2

Thank you very much for your answer Geraldine!, you are very kind!

• Ithaca, NYMember Posts: 1

#### Issue · Github July 2015 by Sheila

Issue Number
73
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

@gideonite
Hi,

Sorry about that. I will try to get them fixed soon. In the meantime, you can go here: https://www.broadinstitute.org/gatk/guide/tooldocs/ and look under Variant Annotations for the annotation explanations.

-Sheila

Tool docs links have been fixed.

Geraldine Van der Auwera, PhD

Also, doc has been updated.

Geraldine Van der Auwera, PhD

• Member Posts: 78
edited August 2015

Hi there, I am using version 3.4-46-gbc02625 to do joint genotyping with HC. My resulting vcf files contain * in the ALT column, which other software programs don't like. Is this a new thing, and what does it mean? (they show up in the snps file when filtered on variant type)

eg.

1       789256  rs3131939       T       C,*     17024.46        PASS    AC=8,2;AF=0.400,0.100;AN=20;BaseQRankSum=2.40;ClippingRankSum=0.102;DB;DP=187751;FS=18.346;InbreedingCoeff=-0.3372;MLEAC=9,2;MLEAF=0.450,0.100;MQ=42.05;MQRankSum=4.61;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=25.45;ReadPosRankSum=0.714;SOR=1.113;VQSLOD=-4.416e-01;culprit=MQRankSum

• Member Posts: 14

@vsvinti said:
Hi there, I am using version 3.4-46-gbc02625 to do joint genotyping with HC. My resulting vcf files contain * in the ALT column, which other software programs don't like. Is this a new thing, and what does it mean? (they show up in the snps file when filtered on variant type)

eg.

> 1       789256  rs3131939       T       C,*     17024.46        PASS    AC=8,2;AF=0.400,0.100;AN=20;BaseQRankSum=2.40;ClippingRankSum=0.102;DB;DP=187751;FS=18.346;InbreedingCoeff=-0.3372;MLEAC=9,2;MLEAF=0.450,0.100;MQ=42.05;MQRankSum=4.61;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=25.45;ReadPosRankSum=0.714;SOR=1.113;VQSLOD=-4.416e-01;culprit=MQRankSum
> 

I believe it means the SNP lives underneath a deletion from another sample in the vcf.

• KielMember Posts: 109 ✭✭

@Geraldine_VdAuwera

The link to variantsToTable seems to be broken. (Firefox 40.0.3)

Thanks for reporting it, @EADG -- I'll try to get this fixed asap.

Geraldine Van der Auwera, PhD

File was corrupted -- fixed now. Thanks for reporting.

Geraldine Van der Auwera, PhD

• MichiganMember Posts: 1

Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "likelihood of the genotype", we mean it is "the probability that the genotype is not correct". That's why the smaller the value, the better it is.

I think that is incorrect. PL values are the NEGATIVE of 10 times the log (base 10) of the genotype likelihood. Since its negative, small values imply higher probability. For e.g. PL of 0 means genotype likelihood of 1, while PL of 255 (minimum possible value) means almost 0. However, one must note that genotype likelihood GL (of say 0/1) does not mean probability of the genotype 0/1 (coz that would imply that the GLs should add up to 1, which they don't) but in fact it means probability of reads given that genotype. Since, the true genotype is something we never know, we treat it like an unknown parameter and look at probabilities of reads given genotypes and not the other way round.

@Santy_8128
Hi,

The PL is Phred-scaled. If you look at this article you can see the Phred-scaled score takes into account the error probability. So, usually if a Phred-scaled score is high, we are more likely to believe the outcome is correct. In the case of the PL, the lower the Phred-scaled score, the more likely we are to believe the genotype is correct. The reason we say that the PL reflects "the probability that the genotype is not correct" is because a Phred-scaled of 0 indicates an error probability of 1.

The PL of 0 for the most likely genotype makes it easy to find in the PL field, so that is why it was chosen. It is easier than having to check for the highest PL value.

-Sheila

• Beijing Institute of Genomics, CASMember Posts: 112
edited December 2015

Hi @Sheila
Under:
5. How the genotype and other sample-level information is represented >> GQ : Quality of the assigned genotype
I quote:
"The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). "

here how 10^(-2.6) or 0.0025 is calculated? from where -2.6 come from? i am sorry i don't know about GQ, PL value calculations

This shows how to work backward through the calculation to the value obtained from the genotyping engine. Say the genotyping found a probability of 0.0025, the phred scaling transformation produces the value of 26 as explained in the FAQ on Phred scaling.

Geraldine Van der Auwera, PhD

• GreensboroMember Posts: 79

It may be a silly question, but I am little confused with the concept of AC (allele count) and AN (allele number).
AC - is the number of allele called in a sample, and AN - is the total number of alleles in the called genotype.
**
So, for the portion of vcf file shown below;** for 1st position, shouldn't the AC=1 (number of allele called for this sample at this locus) since its genotype is 1/1 and AN = 2 (since its diploid there are 2 alleles in total at this locus).

and for 3rd position AC = 2 (since two alleles were called) and AN = 2 (there are 2 alleles in total) for genotype 0/1.

# CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1

211000022279114 1152 . A C 22.76 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=31.62;QD=11.38;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:50,6,0
211000022279114 1617 . T C 45.74 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=34.12;QD=22.87;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:73,6,0
211000022280187 11226 . C T 112.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.922;ClippingRankSum=0.000;DP=9;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=50.94;MQRankSum=0.000;QD=14.10;ReadPosRankSum=-0.421;SOR=1.981 GT:AD:DP:GQ:PL 0/1:4,4:8:99:141,0,15

What am I not understanding here ?)

• Thanks,
• GreensboroMember Posts: 79
edited March 2016

Another question:
When calling raw variants using HaplotypeCaller-GATK 3.5 version I got this warning message:
_WARN 21:15:29,572 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 2L:7942137 has 8 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument. This warning message is output just once per run and further warnings will be suppressed unless the DEBUG logging level is used. _

I think this has to do with the site being repetitive?? since I called variant from one diploid sample any sites shouldn't have more than 2 types of alleles?? And I would like to eliminate any variants called at this sites - I was planning to use the --filterExpression AN > 2 for the purpose.
But, before filtering I checked the raw variants at this position (2L:7942137) and find that there are only two alleles reported - In the warning message it clearly says it called 8 alternate alleles and used 6 top alternate alleles.

2L 7942137 . A ATTT 276.73 . AC=2;AF=1.00;AN=2;DP=9;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=24.48;SOR=1.179 GT:AD:DP:GQ:PL 1/1:0,7:7:24:285,24,0

Thanks in advance !

The AC and AN are a little confusing, I agree. The AC just gives you the number of times the alternate allele shows up in the genotype. For example, let's say I have this site:
20 10008950 . AACACACACACAC A,AACACACACAC 4319.86 . AC=4,2;AF=0.667,0.333;AN=6;BaseQRankSum=-1.317;DP=83;FS=3.768;MLEAC=4,2;MLEAF=0.667,0.333;MQ=59.55;MQ0=0;MQRankSum=1.178;QD=18.02;RPA=23,17,22;RU=AC;ReadPosRankSum=1.455;SOR=0.829;STR GT:AD:DP:GQ:PL 1/2:0,6,11:28:99:1202,422,1392,711,0,661 1/2:1,5,12:27:99:1062,423,1376,574,0,540 1/1:0,15,0:28:45:2091,45,0,1837,45,1792

Notice the two alternate alleles. The first alternate allele shows up 4 times in the genotypes. The second alternate allele shows up 2 times. So, the AC is 4,2 to represent that.

The AN basically gives you the number of alleles in the called genotypes. For example, let's say I have this site:
1 30923 . G T 64.17 . AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=32.08;SOR=0.693 GT:AD:DP:GQ:PL ./. 1/1:0,2:2:6:90,6,0

One of the samples is no-call, and the other sample has two alleles called. The AN shows 2. This can be useful for filtering when you want to fail sites that have more than half the samples with no-calls.

I hope this makes sense.

-Sheila

@everestial007
Hi again,

For your second question about the WARN for alternate alleles, that is just letting you know there were more than the default alternate alleles found at that site. You can change the default with -maxAltAlleles.

HaplotypeCaller evaluates evidence for all possible alternate alleles, but in your case the only two alleles output were the ones that had enough evidence to be reported in the GVCF. (The other alleles probably did not have enough evidence to be output).

-Sheila

• GreensboroMember Posts: 79

@Sheila
Thanks much for being so much helpful again. This clarification made a lots of difference to me.

Regarding the sites with warning I will check visually if variants from this sites should be ignored.

• K
• earthMember Posts: 29

(This may not be the right page to ask this, but I didn't find a better one)

Is there any fast way to test whether a .vcf file and/or its .idx file are corrupt?

I've been running many variant calling jobs on a SLURM cluster, and (sadly) it has been difficult to determine which jobs succeeded and which did not. I've been trying to look at the end of the index files with hex dump and what I'm seeing may indicate that some of them are truncated. This would suggest the corresponding job did not finish successfully, and I would throw out the vcf and idx and try again.

But I don't really know what the idx file is supposed to look like so it's hard to make a decision based on its content.

I've tried to find a formal spec for the .vcf.idx file, but as yet have not found one. Does it exist?

Or, does GATK have a VerifyVcf tool?

Bob H

@BobHarris
Hi Bob H,

You can use ValidateVariants to validate your VCF. I'm not aware of any vcf.idx validation tools. However, if your VCF validates, you can simply delete the .idx file and GATK will regenerate one for you

Sheila

• earthMember Posts: 29

Thanks @Sheila . But if a vcf file is truncated, isn't there a chance it will pass the validation? There doesn't seem to be anything in the file spec that indicates a proper ending, nor anything in the header that indicates the number of records to follow. So if a variant calling job failed, how can the validator know?

If a file is truncated (eg because transfer or writing was interrupted externally) it will lack an appropriate EOF (end of file) encoding, so the validator should be able to pick it up.

Geraldine Van der Auwera, PhD

• earthMember Posts: 29

Thanks @Geraldine_VdAuwera . But I don't see any EOF encoding indicated in the vcf file spec, and I don't see an ascii EOF character (hex 04) at the end of any of my vcf files. The spec I'm looking at is here:
http://samtools.github.io/hts-specs/VCFv4.2.pdf (my files indicate VCFv4.2 in the header).

I think Sheila interpreted my question as a concern about .idx files being corrupt. What I'm trying to do is find a way to tell if the job that created the .idx (and .vcf) completed successfully or not, so that I can know whether I need to rerun it.

Unfortunately the cluster I am using is is not reliable for reporting job success/failure.

The "writing was interrupted externally" is the case I am worried about, where the interrupt happened because the job died (e.g. because it needed more memory than was available) or was killed (apparently on SLURM if too much non-resident memory is in use the job managers starts killing jobs).

Thanks for your help,
Bob H

VCF spec itself doesn't specify an EOF marker but I'd be surprised if there wasn't one at all. Have you tried testing this? You can start a job that writes a VCF and kill the process mid-run. Then run the file (which should be invalid) through the validator and see if it complains. Or, have your pipelining system look for exit codes. If the run completed successfully GATK will output exit code 0; if something failed internally it will output exit code 1, and if anything happened to shut it down from outside it should return some other non-0 exit code.

Geraldine Van der Auwera, PhD

• earthMember Posts: 29

@Geraldine_VdAuwera I can try some tests related to EOF. I think programs that write an actual EOF character are pretty rare in *nix realm though. All the text-based bioinformatics formats I've looked at (with one exception) don't have any file terminator specified. Some of the binary formats do (e.g. BAM), which is why I was hoping to find a spec for the .vcf.idx file. It's increasingly apparent that's a dead end, but surely a spec has to exist somewhere(!).

Regarding the exit codes, I can build something like that into my job scripts in the future. At present I have 687 vcf files (and .idx for each), about 10% of my project (for one of three species), and I'd like to avoid rerunning most of these if I can assure myself they are correct.

Thanks,
Bob H

Hmm that's fair, my expectation of EOFs may be a naive one -- I've never actually looked for them.

In case you're open to migrating to a different pipelining system -- we're about to announce a new pipelining solution which you can read more about here: https://software.broadinstitute.org/wdl/. It includes checking for proper job completion. We're in the process of migrating our production pipelines to this system and we will be providing support to anyone else who wants to use it.

Geraldine Van der Auwera, PhD

• Member Posts: 3

Some data in my vcf files have 1/2 at genotype field. What does that mean? Thank you.

@mbk0asis
Hi,

If you look at the reported alleles at the site, you will see more than one alternate allele. The 1/2 genotype means the genotype is heterozygous for the first alternate allele and second alternate allele. Have a look at the VCF spec under "1.4.2 Genotype fields" for more information.

-Sheila

• DonostiaMember Posts: 3

Hi,
I have 160.000 variants and all of them have "QUAL" = 0
How I can correct?

Thanks.

@Marrospi It's not possible to fix that, you need to redo the analysis. How was this vcf generated?

Geraldine Van der Auwera, PhD

• DonostiaMember Posts: 3

for every .sam for each sample:
1-Samtools import to get the .bam file
3-PICARD SortSam
after, for each sample:
1-PICARD MergeSamFiles
2-GATK BaseRecalibrator

finally,
java -jar GATK -T MuTect2 -R fast.fast -I:tumor inputFile1.bam -I:normal inputFile2.bam --dbsnp dbsnp.vcf -o outputFileMuTec.vcf

Oh, that's fine then (I assumed you were calling germline variants). MuTect doesn't emit QUAL scores. The lod annotations are what you should look at.

Geraldine Van der Auwera, PhD

• Dunedin, NZMember Posts: 1

Hi @Geraldine_VdAuwera, what is the difference between 0|1 and 1|0 in GT field?

@lcscs
Hi,

I think you will find this article helpful. Especially, have a look under "More detailed aspects of semantics of phasing in the VCF format"

-Sheila

• DonostiaMember Posts: 3

Thanks for your answer Geraldine, It has helped me a lot.

How can I get a value similar to a quality? My VCF file has the next information:

chr1 240188328 rs4659948 T C . germline_risk DB;ECNT=1;HCNT=2;MAX_ED=.;MIN_ED=.;NLOD=4.51;TLOD=17.13 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:15,7:0.318:4:3:0.571:477,200:5:10 0/0:15,0:0.00:0:0:.:485,0:11:4
chr1 240188711 rs6429186 A G . clustered_events DB;ECNT=2;HCNT=2;MAX_ED=59;MIN_ED=59;NLOD=8.13;TLOD=38.88 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:16,13:0.448:7:6:0.462:491,413:8:8 0/0:27,0:0.00:0:0:.:858,0:11:16
chr1 240188770 rs7530912 A G . clustered_events;t_lod_fstar DB;ECNT=2;HCNT=2;MAX_ED=59;MIN_ED=59;NLOD=5.72;TLOD=6.28 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:13,3:0.188:0:3:1.00:412,82:7:6 0/0:19,0:0.00:0:0:.:597,0:9:10

@Marrospi This article shows the typical content of VCFs produced for germline calls; I just added a note to clarify that at the top of the page. Your VCF was produced by MuTect for somatic calls, which are a bit different in some respects. One major difference is that they don't have a QUAL score. Instead, you can look at the NLOD and TLOD values. See the MuTect documentation for more information.

Geraldine Van der Auwera, PhD

• ukMember Posts: 5

been hinted at in a question above, but just to be explicit:
is DP really "the number of filtered reads that support each of the reported alleles", or maybe more "the number of filtered reads across all the reported alleles"?
they seem to be single values.

thanks.

@flytrap
Hi,

There are two different DP fields in the VCF.

Have a look here and here for more information.

-Sheila

• ChinaMember Posts: 10

Just so you know... Two links from Best-Practices page has been messed up.

"What is a VCF and how should I interpret it?" Actually points to "Which tools use pedigree information?" , and vice versa.

#### Issue · Github September 2016 by Sheila

Issue Number
1484
State
closed
Last Updated
Assignee
Array
Closed By
vdauwera

@570932004
Hi,

Thank you for pointing it out
We will fix it soon.

Thanks,
Sheila

• torontoMember Posts: 27
edited March 2

Hi,

If someone could help clarify this issue I'd really appreciate it. Thanks.

I had a multisample VCF that I split into samples with selectvar and then sent to tables VAriantstoTables using the splitmultiallelic option. I noticed that for some samples I will have variants with more than 3 PL values such 6 values.

Firstly, I annotated with the VEP which is where the ALLELE comes from. Now of course I noticed that for the first sample below the individual is two different alternate alleles ( A and T ) so I just assumed that the PLs were for those two variants.But the values don't have the right format. You can see one set of triplets has a zero value ( the most likely genotype ) but the other does not. And the VEP calls all ALTs A. Furthermore the second individual has one ALT and 6 values.

Why is the VEP calling things A when we have ALT is * or ALT is T? ( o.k you aren't expected to know that one but even a guess would be helpful). It matters because when you count the ALT alleles at this site for the first individual, for example, do you count one ALT or two ALT? There a number of sites likes this I need to know what's going on here to proceed.

SAMPLE_ID SEX CHROM POS ID GQ PL REF ALT AC HET HOM ALLELE
1256-20693 XX chr1 17215982 rs3738617 99 2196,527,381,1668,0,1620 C A 1 1 0 A
1256-20693 XX chr1 17215982 rs3738617 99 2196,527,381,1668,0,1620 C T 1 1 0 A
1694-22917 XY chr1 17222284 . 99 127,0,1698,253,1718,1971 G * 1 1 0 A
1694-22917 XY chr1 17222284 . 99 127,0,1698,253,1718,1971 G * 1 1 0 A
1694-22917 XY chr1 17222284 . 99 127,0,1698,253,1718,1971 G * 1 1 0 A
1694-22917 XY chr1 17222284 . 99 127,0,1698,253,1718,1971 G * 1 1 0 A
1694-22917 XY chr1 17222284 . 99 127,0,1698,253,1718,1971 G * 1 1 0 A

• torontoMember Posts: 27

I should add that prior to splitting my multi sample VCF into sample level VCFs I used SelectVar to remove all nonvariant and low quality sites. That step was repeated again with the sample level VCFs. I've also taken a closer look and seen some strange PLs like all zeros 0,0,0 or two zeros 0,0,258.

@robertb
Hi,

Can you please post the corresponding VCF records for those sites?

Thanks,
Sheila

• torontoMember Posts: 27

Problem solved, maybe. Here's line from the format genotype field for a multiallelic site in the VCF:
0/2:6,0,2:8:PASS:38:53,64,108,0,44,38:53,64,108,0,44,38

Here's a similar site from the table:

CHROM POS ID REF ALT HET HOM AC
chr1 26211207 rs159529 A AAG 73 39 131

sample pp ref alt genotypes
1034-19683 337,8,0,343,33,368 A AAG AAG/AAG
1291-21555 154,0,49,160,65,224 A AAG A/AAG
1442-22354 753,66,0,753,66,754 A AAG AAG/AAG
1454-21674 734,58,0,736,66,744 A AAG AAG/AAG
1510-21940 452,39,0,452,39,452 A AAG AAG/AAG

chr1 26211207 rs159529 A G 73 39 20

1986-24087 53,63,133,0,70,64 A G A/G
1987-24091 208,0,264,238,290,529 A G A/AAG
1988-24094 122,0,242,146,257,403 A G A/AAG
1989-24099 107,0,200,128,212,340 A G A/AAG
19991-31601 127,0,268,154,284,438 A G A/AAG
20010-31644 151,177,381,0,205,187 A G A/G

I'm filtering genotypes and if a sample does not pass filter ( gq > 20 and dp > 10 ) then I need to adjust the AC and HET/HOM counts accordingly. Some variants even have three alternates and the PP field has 10 values (below). I'm not sure how interpret these. I mean the first three pp values correspond to 0/0, 0/1, 1/1 genotypes but what the next three or seven or more correspond to I'm not sure at this point. Apparently the fourth value is het-alt for the second alternate and the fifth value is hom-alt for the second alternate but what about the sixth? Not sure? And the tenth? I've been searching the web for answers but finding nothing. I someone could clarify this please I'd appreciate it. thanks.

CHROM POS ID REF ALT HET HOM AC
chr1 161035658 rs376275626 CTTAT CTTATTTAT 16 7 15

                                                                                   REF            ALT                  GT


1034-19683 121,127,226,0,100,90,127,226,100,226 CTTAT CTTATTTAT CTTAT/C
1291-21555 70,85,333,0,249,242,85,333,249,333 CTTAT CTTATTTAT CTTAT/C
15542-26229 69,0,44,73,50,123,73,50,123,123 CTTAT CTTATTTAT CTTAT/CTTATTTAT
1588-22326 210,126,121,83,0,74,210,126,83,210 CTTAT CTTATTTAT CTTATTTAT/C
1876-23644 79,85,185,0,100,93,85,185,100,185 CTTAT CTTATTTAT CTTAT/C
19303-30889 134,9,0,135,9,135,135,9,135,135 CTTAT CTTATTTAT CTTATTTAT/CTTATTTAT
2007-24164 158,161,198,0,37,24,161,198,37,198 CTTAT CTTATTTAT CTTAT/C
21684-34154 77,0,94,84,100,184,84,100,184,184 CTTAT CTTATTTAT CTTAT/CTTATTTAT
21762-34808 179,12,0,180,12,180,180,12,180,180 CTTAT CTTATTTAT CTTATTTAT/CTTATTTAT
21928-37301 78,0,94,84,100,184,84,100,184,184 CTTAT CTTATTTAT CTTAT/CTTATTTAT
466-14420 82,85,127,0,42,35,85,127,42,127 CTTAT CTTATTTAT CTTAT/C
779-18300 134,8,0,134,9,135,134,9,135,135 CTTAT CTTATTTAT CTTATTTAT/CTTATTTAT

chr1 161035658 rs376275626 CTTAT C 16 7 13
1034-19683 121,127,226,0,100,90,127,226,100,226 CTTAT C CTTAT/C
1291-21555 70,85,333,0,249,242,85,333,249,333 CTTAT C CTTAT/C
15542-26229 69,0,44,73,50,123,73,50,123,123 CTTAT C CTTAT/CTTATTTAT
1588-22326 210,126,121,83,0,74,210,126,83,210 CTTAT C CTTATTTAT/C
1876-23644 79,85,185,0,100,93,85,185,100,185 CTTAT C CTTAT/C
19303-30889 134,9,0,135,9,135,135,9,135,135 CTTAT C CTTATTTAT/CTTATTTAT
2007-24164 158,161,198,0,37,24,161,198,37,198 CTTAT C CTTAT/C
21684-34154 77,0,94,84,100,184,84,100,184,184 CTTAT C CTTAT/CTTATTTAT
21762-34808 179,12,0,180,12,180,180,12,180,180 CTTAT C CTTATTTAT/CTTATTTAT
21928-37301 78,0,94,84,100,184,84,100,184,184 CTTAT C CTTAT/CTTATTTAT
466-14420 82,85,127,0,42,35,85,127,42,127 CTTAT C CTTAT/C
779-18300 134,8,0,134,9,135,134,9,135,135 CTTAT C CTTATTTAT/CTTATTTAT

chr1 161035658 rs376275626 CTTAT CTTATTTATTTAT 16 7 3
1034-19683 121,127,226,0,100,90,127,226,100,226 CTTAT CTTATTTATTTAT CTTAT/C
1291-21555 70,85,333,0,249,242,85,333,249,333 CTTAT CTTATTTATTTAT CTTAT/C
15542-26229 69,0,44,73,50,123,73,50,123,123 CTTAT CTTATTTATTTAT CTTAT/CTTATTTAT
1588-22326 210,126,121,83,0,74,210,126,83,210 CTTAT CTTATTTATTTAT CTTATTTAT/C
1876-23644 79,85,185,0,100,93,85,185,100,185 CTTAT CTTATTTATTTAT CTTAT/C
19303-30889 134,9,0,135,9,135,135,9,135,135 CTTAT CTTATTTATTTAT CTTATTTAT/CTTATTTAT
2007-24164 158,161,198,0,37,24,161,198,37,198 CTTAT CTTATTTATTTAT CTTAT/C
21684-34154 77,0,94,84,100,184,84,100,184,184 CTTAT CTTATTTATTTAT CTTAT/CTTATTTAT
21762-34808 179,12,0,180,12,180,180,12,180,180 CTTAT CTTATTTATTTAT CTTATTTAT/CTTATTTAT
21928-37301 78,0,94,84,100,184,84,100,184,184 CTTAT CTTATTTATTTAT CTTAT/CTTATTTAT
466-14420 82,85,127,0,42,35,85,127,42,127 CTTAT CTTATTTATTTAT CTTAT/C
779-18300 134,8,0,134,9,135,134,9,135,135 CTTAT CTTATTTATTTAT CTTATTTAT/CTTATTTAT

• torontoMember Posts: 27

here's what's going on and in case anyone reads this the answer is that for a split multiallelic site the PP or PL values will correspond to: 0/0,0/1,1/1,0/2,1/2,2/2,0/3,1/3,2/3,3/3 for a site with three alternates and hence to ten values I was asking about.

GL : genotype likelihoods comprised of comma separated floating point log10-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc. For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)`