The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

GATK's quality scores on VCF files

yc459yc459 Posts: 4Member
edited March 2013 in Ask the GATK team

Hi,

According to the link http://www.1000genomes.org/wiki/Analysis/Variant Call Format/vcf-variant-call-format-version-41.

quality score (phred score) is defined as below. (i.e. 1% error rate is equal to phred score of 20 (-10xlog 0.01))

QUAL phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value should be specified. (Numeric)

Using GATK to generate vcf files and looking through the quality column of those files, I found out that the max quality score is 441,453 which is extremely huge number.

I wonder if the quality score of GATK tool follows the phred score system; if not, how do you calculate the quality score and what do the numbers of quality score represent?

Look forward to hearing back from you soon and thank you very much.

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Yes, the QUAL emitted by GATK is phred-scaled and corresponds to the definition given in the VCF specification.

    Geraldine Van der Auwera, PhD

  • yc459yc459 Posts: 4Member
    edited March 2013
    java -jar $GATK -T UnifiedGenotyper -R ucsc.hg19.fasta { -I $BAM }x17 -o $VCF --dbsnp $DBSNP -stand_call_conf 50 -stand_emit_conf 10 --min_base_quality_score 10 --max_deletion_fraction 0.05 -dt NONE -glm BOTH -baq CALCULATE_AS_NECESSARY -A QualByDepth -A FisherStrand -A HaplotypeScore -A ReadPosRankSumTest -A MappingQualityRankSumTest -A InbreedingCoeff -A RMSMappingQuality -A DepthOfCoverage
    

    Above is the command line I used for generating vcf files

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    I'm sorry, I'm not sure I understand your question -- can you please tell me what is the problem you would like help with?

    Geraldine Van der Auwera, PhD

  • yc459yc459 Posts: 4Member

    Hi, Thank you for your reply, so my question is if the quality scores of vcf files are estimated using phred score system.
    When I compare with the quality scores I got from other tools, the max quality score of 441,453 is way too big so wonder if GATK follows the phred score system.

  • yc459yc459 Posts: 4Member
    edited March 2013

    I understand the scoring system and the below line is one excerpted from my vcf file. It would be great if you let me know the reason why this has extremely high qual score compared to that of other tools.

    chr17     45214544        rs62075619      T       C       557015 VQSRTrancheBOTH99.90to100.00     .       GT:AD:DP:GQ:PL 1/1:118,2105:2116:99:32767,5777,0
    
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Different tools may emit values on very different scales depending on how the probability calculations are handled internally, e.g. at what stage a number is rounded off. I cannot comment on values emitted by other tools. This simply looks like a case where the GATK has determined that it is extremely likely that there is indeed a variant at this site (which is supported by the allele depths). This is not a cause for concern.

    Geraldine Van der Auwera, PhD

  • ofogh1974ofogh1974 Posts: 2Member

    I have done the alignment with bwa and variant calling with GATK. I have all the variant data. I copy,paste a part of my data. please let me know, is it good for analysis ??! Many of variations have "low quality" in column with title "FILTER" ??!! Thank you

    CHROM POS REF ALT QUAL FILTER INFO FORMAT sm

    chr3 60596 C A 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
    chr3 60648 A G 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
    chr3 96098 G A 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
    chr3 104222 T C 14.68 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=14.68 GT:AD:DP:GQ:PL 1/1:0,1:1:3:41,3,0
    chr3 121954 C T 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
    chr3 124835 G A 10.9 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=10.90 GT:AD:DP:GQ:PL 1/1:0,1:1:3:37,3,0
    chr3 168610 G A 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
    chr3 174816 A G 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Hi @ofogh1974,

    It looks like you have very low depth at those sites (DP = 1). That's probably why you are getting low qualities for your variants. There is not enough data for the program to call the sites confidently.

    Geraldine Van der Auwera, PhD

  • ofogh1974ofogh1974 Posts: 2Member

    Hi Geraldine. Many thanks for your promptly answer.
    Best regards
    Afagh

  • kjclowerskjclowers UW MadisonPosts: 14Member

    Is the presence/absence of a call in the dbSNP used in calculation the QUAL field?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    No, dbSNP is only used to annotate the rsID field where applicable. It is not used in any way in the actual variant calling algorithms.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 261Member ✭✭

    how is the QUAL computed in GATK?

  • SheilaSheila Broad InstitutePosts: 3,969Member, Broadie, Moderator, Dev admin

    @blueskypy
    Hi,

    There is some fancy math involved and I am in the process of writing a document to explain it, but for now I hope you can accept this answer:

    The QUAL score is the Phred-scaled posterior of AC = 0. We use the AC priors and the PLs to get the likelihood of the data given each AC, then use those to get the posterior probability for each AC. From there, the calculation is 1 - Pr{AC > 0}.

    -Sheila

  • blueskypyblueskypy Posts: 261Member ✭✭

    @Sheila Thanks!

  • blueskypyblueskypy Posts: 261Member ✭✭
    edited June 2015

    @Sheila
    does the term "Variant Quality Score" refer to the QUAL field? if so, is the QUAL changed by VQSR? (Seems to me the VQSR is to compute the VQSLOD which is used to set the FILTER field). Thanks!

  • blueskypyblueskypy Posts: 261Member ✭✭

    does the term "Quality Score" refer to the QUAL field?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    @blueskypy No, Variant Quality Score in the context of VQSR refers to the VQSLOD, which is distinct from QUAL

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 261Member ✭✭

    @Geraldine_VdAuwera
    Don't mean to be picky! While the BQSR is a perfect term, the VQSR is kind of misleading. Is it actually the creation (instead of recalibration) of VQSLOD, a new ID to better represent VQ (instead of VQS)? :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    That is a very good way to put it, yes. :)

    Geraldine Van der Auwera, PhD

  • FredZhouFredZhou HKPosts: 1Member

    @Sheila said:
    @blueskypy
    Hi,

    There is some fancy math involved and I am in the process of writing a document to explain it, but for now I hope you can accept this answer:

    The QUAL score is the Phred-scaled posterior of AC = 0. We use the AC priors and the PLs to get the likelihood of the data given each AC, then use those to get the posterior probability for each AC. From there, the calculation is 1 - Pr{AC > 0}.

    -Sheila

    @Sheila:
    Hi Sheila,

    I'm trying to figure out indeed how does the software calculate the QUAL, and I'm a little bit confused why do we need the AC prior. Is it based on the binomial distribution with the given AF, or a certain mutation rate setting as default value (say, 10E-3) for GATK?. It might be naive but I couldn't find any official documentation.

    Or, can we get the probability of homozygous reference allele (0/0) calling for each individual based on the PL, then for the situation AC=0, the probability would be the multiplication of probability when all individual calling 0/0? Since we are aiming to estimate the potability that whether certain site have any positive signal.

    Best,
    Fred

    Issue · Github
    by Sheila

    Issue Number
    714
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Hi Fred, we just posted a document explaining the QUAL score calculation here: https://www.broadinstitute.org/gatk/guide/article?id=7258

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.