The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.4 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

re-scaling genotype likelihoods


I am trying to incorporate genotype likelihoods into a downstream analysis. I have two questions:

1) Why is the most likely genotype scaled to a Phred score of zero?

2) Is there a way to undo the scaling? I have seen downstream tools undo the scaling, but I don't know how they do it. Is there an equation that will return an estimated genotype likelihood from the scaled genotype likelihoods?

Thank you for your time.

Zev Kronenberg


  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Hi Zev,

    1) This is just a normalization (not a scaling) and does not affect the actual posterior probabilities at all. This isn't the appropriate forum to go over the mathematical rationale though so you'll either need to take my word for it or ask for an explanation on somewhere like seqanswers.
    2) There is no need to undo the normalization and I cannot imagine that any downstream tools are actually doing this (again see #1). The likelihoods in the VCFs are not "scaled" or "estimated" and should be taken as accurate representations of the data.

    Hope that helps!

  • I am going to try and clarify my question:

    I completely trust the genotype calculations, but I am still having trouble incorporating PL into a population genetics measure. My problem is the normalization:

    The normalization sets the most likely genotype to a phred scaled likelihood of 0 / a p-value of 1.

    "Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"

    "The most likely genotype (given in the GT field) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype."

    So in the case of a terrible het call the genotype likelihoods will be something like (2, 0, 1). AA AB BB.

    The problem is assessing the uncertainly of the het call with a p-value of 1 / phred score of zero.

    When I integrate over the other genotypes AA & AB I am concerned I am introducing a bias.

    Maybe I don't need to worry about it. I just noticed that other tools, like BEAGLE, that use GATK VCFs, have a modified PL where the most likely genotype is not required to have a phred score of zero.


  • I think the easiest way around this is:?

    phred / sum(phreds)

    that will somewhat undo the normalization.

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Okay, I think I understand the disconnect now. It is critical to understand that likelihoods are different than probabilities. With likelihoods really only the relative values matter, so 20 vs. 10 is the same as 10 vs. 0; that is why we don't lose any information during the normalization process. With that in mind the GATK does not need to normalize the likelihoods (and that's why e.g. Beagle doesn't require it) - we just do it because it's cleaner (and that's the convention). So there is no bias involved in the normalization process.

    I do have to say though that I'm concerned that you aren't quite understanding what phred-scaled likelihoods are. The "fix" that you propose above is not good. The likelihoods are in log-space and need to be converted to real-space before you can create normalized posterior probabilities from them. I don't mean to put you down (I am sure you are very competent) but please make sure you understand what data you have in hand before trying to manipulate it!

    Good luck!

Sign In or Register to comment.