Our documentation websites are currently offline due to a data center fire. We do not yet have an ETA for restoring service; we’ll update this message when we know more.

What is the difference between QUAL and GQ annotations?

SheilaSheila Broad InstituteMember, Broadie, Moderator
edited November 2014 in Frequently Asked Questions

There has been a lot of confusion about the difference between QUAL and GQ, and we hope this FAQ will clarify the difference.

The basic difference is that QUAL refers to the variant site whereas GQ refers to a specific sample's GT.

  • QUAL tells you how confident we are that there is some kind of variation at a given site. The variation may be present in one or more samples.

  • GQ tells you how confident we are that the genotype we assigned to a particular sample is correct. It is simply the second lowest PL, because it is the difference between the second lowest PL and the lowest PL (always 0).

QUAL (or more importantly, its normalized form, QD) is mostly useful in multisample context. When you are recalibrating a cohort callset, you're going to be looking exclusively at site-level annotations like QD, because at that point what you're looking for is evidence of variation overall. That way you don't rely too much on individual sample calls, which are less robust.

In fact, many cohort studies don't even really care about individual genotype assignments, so they only use site annotations for their entire analysis.

Conversely, QUAL may seem redundant if you have only one sample. Especially if it has a good GQ (and more importantly, well separated PLs) then admittedly you don't really need to look at the QUAL -- you know what you have. If the GQ is not good, you can typically rely on the PLs to tell you whether you do probably have a variant, but we're just not sure if it's het or hom-var. If hom-ref is also a possibility, the call may be a potential false positive.

That said, it is more effective to filter on site-level annotations first, then refine and filter genotypes as appropriate. That's the workflow we recommend, based on years of experience doing this at fairly large scales...

Post edited by Geraldine_VdAuwera on

Comments

  • sdsmithsdsmith MadisonMember

    Are you able to help me understand a little better what QD is telling me? I understand it is the ration of the QUAL to the AD, but what is that number saying, in terms of how can I use that number to determine what I want my threshold to be for PASS/FAIL in my filter?

    Thanks,
    SS

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sdsmith
    Hi SS,

    We have some basic recommendations for hard filtering here: https://www.broadinstitute.org/gatk/guide/article?id=2806 However, it will be up to you to analyze your data and determine what cutoffs to use.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @sdsmith After some discussion we realized that it can be difficult to understand the meaningfulness of the annotation threshold values used for filtering, so @Sheila is going to start a project to document this in a lot more detail. This will happen over the next few weeks.

  • nkobmoonkobmoo ParisMember

    Hi,

    I'm really interested in a detailed documentation on SNP annotation threshold for filtering. If such document exists, could you pleas point us to it?

    Thank you very much in advance.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nkobmoo
    Hi,

    There is this document that should help. I am working on adding some more explanations, but it should be a good place to start.

    -Sheila

  • SWATISWATI UKMember

    Hello Sheila,
    As I understand, QUAL is a representation of accuracy of genotyping. But what does a '.' represent under the QUAL column in a VCF file? I do not have any numeric value for Phred-scaled score for assertion of ALT allele in the entire column.

    What does this mean for filtering low quality SNPs or genotypes?

    Thanks
    Swati

  • SWATISWATI UKMember

    Dear Sheila,

    This is an example of my filtered recode VCF file:

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 10

    ET_C6390828 41 S1_612208905 G T . PASS .;DP=119 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0
    ET_C6410100 69 S1_614033230 A G . PASS .;DP=2833 GT:AD:DP:GQ:PL 0/1:19,19:38:100:255,0,255 0/1:5,10:15:99:255,0,135
    ET_C6742648 84 S1_647090026 C T . PASS .;DP=8447 GT:AD:DP:GQ:PL 0/1:48,9:57:99:152,0,255 0/1:28,4:32:99:48,0,255

    From your previous discussion (http://gatkforums.broadinstitute.org/gatk/discussion/4688/qual-is-a-dot-and-filter-is-pass-in-vcf), I understood that
    1. the sites with ./. genotypes are no-call sites, [...]. A no-call site means there was not enough information to make a genotype call. You can tell a no-call site because there is no QUAL and no genotype (GT).
    2. the term 'PASS' was added during a subsequent filtering step (file named as filtered.recode.vcf) by the genomics facility provider. They have used MAF (>0.01) & missing data per site (<90%) to as filtering options. This is confirmed as I do not see any ##FILTER information mentioned in the VCF file header.

    But I'm not sure what it means when I have a genotype and a 'dot' for QUAL.

    Thank you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @SWATI
    Hi Swati,

    Did you produce the VCF using GATK tools? If so, can you tell us the exact command you ran and what version of GATK you are using?

    Thanks,
    Sheila

  • SWATISWATI UKMember

    Dear Sheila,
    Thank you for your reply. I got the VCF file from my GBS service provider who used their Tassel pipeline. The summary report says, "VCF is format for holding SNP information that retains information on depth of coverage for each allele, and can be output from the GBS pipeline by replacing the plugins ‘TagsToSNPByAlignmentPlugin’ and ‘MergeDuplicateSNPsPlugin’ with ‘tbt2vcfPlugin’ and ‘MergeDuplicateSNP_vcf_Plugin’. Genotype likelihood scores are calculated based on formula 3.8 of Etter et al 2013=1, and the most likely genotype is assigned. Genotype quality (GQ) score is calculated to the GATK version documented here: http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk ."

    I think these could be the commands used to generate the VCF:
    Memory Settings: -Xms512m -Xmx64G
    Tassel Pipeline Arguments: -fork1 -MergeDuplicateSNP_vcf_Plugin -i /workdir/qisun/working/qs105/VCF/MERGETBT.c1 -o /workdir/qisun/working/qs105/VCF/1.vcf -ak 3 -endPlugin -runfork1
    [main] INFO net.maizegenetics.pipeline.TasselPipeline - Tassel Version: 3.0.165 Date: January 16, 2014

    & The VCF header reads like:

    fileformat=VCFv4.0

    Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">

    FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

    FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">

    FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">

    FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">

    FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">

    INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

    INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

    INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">

    P.S. I am a biologist and trying to learn and still trying to learn bioinformatics. I am afraid, I may not be familiar with very technical terms & command line. Hence, the long post.

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    @SWATI, if these files were produced by a caller that is not part of GATK we can't help you. You should ask the provider for help. Good luck.
  • manasakg16manasakg16 bangaloreMember

    Hi
    can you please provide me the link where it explains about Genotype quality (GQ) score and commands
    Thanks in advance

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
Sign In or Register to comment.