To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Calculation of PL and GQ by HaplotypeCaller and GenotypeGVCFs

PL is a sample-level annotation calculated by HaplotypeCaller and GenotypeGVCFs, recorded in the sample-level columns of variant records in VCF files. This annotation represents the normalized Phred-scaled likelihoods of the genotypes considered in the variant record for each sample.

This article clarifies how the PL values are calculated and how this relates to the value of the GQ field.

Contents

  1. The basic math
  2. Example and interpretation
  3. Special case: non-reference confidence model (GVCF mode)

1. The basic math

The basic formula for calculating PL is:

$$ PL = -10 * \log{P(Genotype | Data)} $$

where P(Genotype | Data) is the conditional probability of the Genotype given the sequence Data that we have observed. The process by which we determine the value of P(Genotype | Data) is described here.

Once we have that probability, we simply take the log of it and multiply it by -10 to put it into Phred scale. Then we normalize the values across all genotypes so that the PL value of the most likely genotype is 0, which we do simply by subtracting the value of the lowest PL from all the values.

The reason we like to work in Phred scale is because it makes it much easier to work with the very small numbers involved in these calculations. One thing to keep in mind of course is that Phred is a log scale, so whenever we need to do a division or multiplication operation (e.g. multiplying probabilities), in Phred scale this will be done as a subtraction or addition.


2. Example and interpretation

Here’s a worked-out example to illustrate this process. Suppose we have a site where the reference allele is A, we observed one read that has a non-reference allele T at the position of interest, and we have in hand the conditional probabilities calculated by HaplotypeCaller based on that one read (if we had more reads, their contributions would be multiplied -- or in log space, added).

Please note that the values chosen for this example have been simplified and may not be reflective of actual probabilities calculated by Haplotype Caller.

# Alleles
Reference: A
Read: T

# Conditional probabilities calculated by HC 
P(AA | Data) = 0.000001
P(AT | Data) = 0.000100
P(TT | Data) = 0.010000

Calculate the raw PL values

We want to determine the PLs of the genotype being 0/0, 0/1, and 1/1, respectively. So we apply the formula given earlier, which yields the following values:

Genotype A/A A/T T/T
Raw PL -10 * log(0.000001) = 60 -10 * log(0.000100) = 40 -10 * log(0.010000) = 20

Our first observation here is that the genotype for which the conditional probability was the highest turns out to get the lowest PL value. This is expected because, as described here, the PL is the likelihood of the genotype, which means (rather unintuitively if you’re not a stats buff) it is the probability that the genotype is not correct. So, low values mean a genotype is more likely, and high values means it’s less likely.

Normalize

At this point we have one more small transformation to make before we emit the final PL values to the VCF: we are going to normalize the values so that the lowest PL value is zero, and the rest are scaled relative to that. Since we’re in log space, we do this simply by subtracting the lowest value, 20, from the others, yielding the following final PL values:

Genotype A/A A/T T/T
Normalized PL 60 - 20 = 40 40 - 20 = 20 20 - 20 = 0

We see that there is a direct relationship between the scaling of the PLs and the original probabilities: we had chosen probabilities that were each 100 times more or less likely than the next, and in the final PLs we see that the values are spaced out by a factor of 20, which is the Phred-scale equivalent of 100. This gives us a very convenient way to estimate how the numbers relate to each other -- and how reliable the genotype assignment is -- with just a glance at the PL field in the VCF record.

Genotype quality

We actually formalize this assessment of genotype quality in the GQ annotation, as described also here.The value of GQ is simply the difference between the second lowest PL and the lowest PL (which is always 0). So, in our example GQ = 20 - 0 = 20. Note that the value of GQ is capped at 99 for practical reasons, so even if the calculated GQ is higher, the value emitted to the VCF will be 99.


3. Special case: non-reference confidence model (GVCF mode)

When you run HaplotypeCaller with -ERC GVCF to produce a gVCF, there is an additional calculation to determine the genotype likelihoods associated with the symbolic <NON-REF> allele (which represents the possibilities that remain once you’ve eliminated the REF allele and any ALT alleles that are being evaluated explicitly).

The PL values for any possible genotype that includes the <NON-REF> allele have to be calculated a little differently than what is explained above because HaplotypeCaller cannot directly determine the conditional probabilities of genotypes involving <NON-REF>. Instead, it uses base quality scores to model the genotype likelihoods.

See also this article.

Post edited by Sheila on
Sign In or Register to comment.