It looks like you're new here. If you want to get involved, click one of these buttons!
Hi,
According to the link http://www.1000genomes.org/wiki/Analysis/Variant Call Format/vcf-variant-call-format-version-41.
quality score (phred score) is defined as below. (i.e. 1% error rate is equal to phred score of 20 (-10xlog 0.01))
QUAL phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value should be specified. (Numeric)
Using GATK to generate vcf files and looking through the quality column of those files, I found out that the max quality score is 441,453 which is extremely huge number.
I wonder if the quality score of GATK tool follows the phred score system; if not, how do you calculate the quality score and what do the numbers of quality score represent?
Look forward to hearing back from you soon and thank you very much.
Yes, as I replied above, the GATK qual is indeed phred-scaled. In some cases the quals can go very high; this is not necessarily a very informative value, and it's not straightforward to compare the quals returned by different programs. If you are concerned about specific calls, please feel free to post a few lines from your VCF for us to have a look at.
Answers
Yes, the QUAL emitted by GATK is phred-scaled and corresponds to the definition given in the VCF specification.
Above is the command line I used for generating vcf files
I'm sorry, I'm not sure I understand your question -- can you please tell me what is the problem you would like help with?
Hi, Thank you for your reply, so my question is if the quality scores of vcf files are estimated using phred score system.
When I compare with the quality scores I got from other tools, the max quality score of 441,453 is way too big so wonder if GATK follows the phred score system.
Yes, as I replied above, the GATK qual is indeed phred-scaled. In some cases the quals can go very high; this is not necessarily a very informative value, and it's not straightforward to compare the quals returned by different programs. If you are concerned about specific calls, please feel free to post a few lines from your VCF for us to have a look at.
I understand the scoring system and the below line is one excerpted from my vcf file. It would be great if you let me know the reason why this has extremely high qual score compared to that of other tools.
Different tools may emit values on very different scales depending on how the probability calculations are handled internally, e.g. at what stage a number is rounded off. I cannot comment on values emitted by other tools. This simply looks like a case where the GATK has determined that it is extremely likely that there is indeed a variant at this site (which is supported by the allele depths). This is not a cause for concern.
I have done the alignment with bwa and variant calling with GATK. I have all the variant data. I copy,paste a part of my data. please let me know, is it good for analysis ??! Many of variations have "low quality" in column with title "FILTER" ??!! Thank you
CHROM POS REF ALT QUAL FILTER INFO FORMAT sm
chr3 60596 C A 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
chr3 60648 A G 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
chr3 96098 G A 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
chr3 104222 T C 14.68 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=14.68 GT:AD:DP:GQ:PL 1/1:0,1:1:3:41,3,0
chr3 121954 C T 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
chr3 124835 G A 10.9 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=10.90 GT:AD:DP:GQ:PL 1/1:0,1:1:3:37,3,0
chr3 168610 G A 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
chr3 174816 A G 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
Hi @ofogh1974,
It looks like you have very low depth at those sites (DP = 1). That's probably why you are getting low qualities for your variants. There is not enough data for the program to call the sites confidently.
Hi Geraldine. Many thanks for your promptly answer.
Best regards
Afagh
Is the presence/absence of a call in the dbSNP used in calculation the QUAL field?
No, dbSNP is only used to annotate the rsID field where applicable. It is not used in any way in the actual variant calling algorithms.
how is the QUAL computed in GATK?
@blueskypy
Hi,
There is some fancy math involved and I am in the process of writing a document to explain it, but for now I hope you can accept this answer:
The QUAL score is the Phred-scaled posterior of AC = 0. We use the AC priors and the PLs to get the likelihood of the data given each AC, then use those to get the posterior probability for each AC. From there, the calculation is 1 - Pr{AC > 0}.
-Sheila
@Sheila Thanks!
@Sheila
does the term "Variant Quality Score" refer to the QUAL field? if so, is the QUAL changed by VQSR? (Seems to me the VQSR is to compute the VQSLOD which is used to set the FILTER field). Thanks!
does the term "Quality Score" refer to the QUAL field?
@blueskypy No, Variant Quality Score in the context of VQSR refers to the VQSLOD, which is distinct from QUAL
@Geraldine_VdAuwera
Don't mean to be picky! While the BQSR is a perfect term, the VQSR is kind of misleading. Is it actually the creation (instead of recalibration) of VQSLOD, a new ID to better represent VQ (instead of VQS)?
That is a very good way to put it, yes.
@Sheila:
Hi Sheila,
I'm trying to figure out indeed how does the software calculate the QUAL, and I'm a little bit confused why do we need the AC prior. Is it based on the binomial distribution with the given AF, or a certain mutation rate setting as default value (say, 10E-3) for GATK?. It might be naive but I couldn't find any official documentation.
Or, can we get the probability of homozygous reference allele (0/0) calling for each individual based on the PL, then for the situation AC=0, the probability would be the multiplication of probability when all individual calling 0/0? Since we are aiming to estimate the potability that whether certain site have any positive signal.
Best,
Fred
Issue · Github
by Sheila
Hi Fred, we just posted a document explaining the QUAL score calculation here: https://www.broadinstitute.org/gatk/guide/article?id=7258