The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# GATK's quality scores on VCF files

Member Posts: 4
edited March 2013

Hi,

quality score (phred score) is defined as below. (i.e. 1% error rate is equal to phred score of 20 (-10xlog 0.01))

QUAL phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value should be specified. (Numeric)

Using GATK to generate vcf files and looking through the quality column of those files, I found out that the max quality score is 441,453 which is extremely huge number.

I wonder if the quality score of GATK tool follows the phred score system; if not, how do you calculate the quality score and what do the numbers of quality score represent?

Look forward to hearing back from you soon and thank you very much.

Tagged:

Yes, the QUAL emitted by GATK is phred-scaled and corresponds to the definition given in the VCF specification.

Geraldine Van der Auwera, PhD

• Member Posts: 4
edited March 2013
java -jar $GATK -T UnifiedGenotyper -R ucsc.hg19.fasta { -I$BAM }x17 -o $VCF --dbsnp$DBSNP -stand_call_conf 50 -stand_emit_conf 10 --min_base_quality_score 10 --max_deletion_fraction 0.05 -dt NONE -glm BOTH -baq CALCULATE_AS_NECESSARY -A QualByDepth -A FisherStrand -A HaplotypeScore -A ReadPosRankSumTest -A MappingQualityRankSumTest -A InbreedingCoeff -A RMSMappingQuality -A DepthOfCoverage


Above is the command line I used for generating vcf files

I'm sorry, I'm not sure I understand your question -- can you please tell me what is the problem you would like help with?

Geraldine Van der Auwera, PhD

• Member Posts: 4

Hi, Thank you for your reply, so my question is if the quality scores of vcf files are estimated using phred score system.
When I compare with the quality scores I got from other tools, the max quality score of 441,453 is way too big so wonder if GATK follows the phred score system.

• Member Posts: 4
edited March 2013

I understand the scoring system and the below line is one excerpted from my vcf file. It would be great if you let me know the reason why this has extremely high qual score compared to that of other tools.

chr17     45214544        rs62075619      T       C       557015 VQSRTrancheBOTH99.90to100.00     .       GT:AD:DP:GQ:PL 1/1:118,2105:2116:99:32767,5777,0
`
Post edited by Geraldine_VdAuwera on

Different tools may emit values on very different scales depending on how the probability calculations are handled internally, e.g. at what stage a number is rounded off. I cannot comment on values emitted by other tools. This simply looks like a case where the GATK has determined that it is extremely likely that there is indeed a variant at this site (which is supported by the allele depths). This is not a cause for concern.

Geraldine Van der Auwera, PhD

• Member Posts: 2

I have done the alignment with bwa and variant calling with GATK. I have all the variant data. I copy,paste a part of my data. please let me know, is it good for analysis ??! Many of variations have "low quality" in column with title "FILTER" ??!! Thank you

# CHROM POS REF ALT QUAL FILTER INFO FORMAT sm

chr3 60596 C A 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
chr3 60648 A G 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
chr3 96098 G A 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
chr3 104222 T C 14.68 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=14.68 GT:AD:DP:GQ:PL 1/1:0,1:1:3:41,3,0
chr3 121954 C T 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
chr3 124835 G A 10.9 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=10.90 GT:AD:DP:GQ:PL 1/1:0,1:1:3:37,3,0
chr3 168610 G A 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0
chr3 174816 A G 15.65 LowQual AC=2;AF=1.00;AN=2;DB;DP=1;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=15.65 GT:AD:DP:GQ:PL 1/1:0,1:1:3:42,3,0

Hi @ofogh1974,

It looks like you have very low depth at those sites (DP = 1). That's probably why you are getting low qualities for your variants. There is not enough data for the program to call the sites confidently.

Geraldine Van der Auwera, PhD

• Member Posts: 2

Best regards
Afagh

Is the presence/absence of a call in the dbSNP used in calculation the QUAL field?

No, dbSNP is only used to annotate the rsID field where applicable. It is not used in any way in the actual variant calling algorithms.

Geraldine Van der Auwera, PhD

• Member Posts: 266 ✭✭

how is the QUAL computed in GATK?

@blueskypy
Hi,

There is some fancy math involved and I am in the process of writing a document to explain it, but for now I hope you can accept this answer:

The QUAL score is the Phred-scaled posterior of AC = 0. We use the AC priors and the PLs to get the likelihood of the data given each AC, then use those to get the posterior probability for each AC. From there, the calculation is 1 - Pr{AC > 0}.

-Sheila

• Member Posts: 266 ✭✭

@Sheila Thanks!

• Member Posts: 266 ✭✭
edited June 2015

@Sheila
does the term "Variant Quality Score" refer to the QUAL field? if so, is the QUAL changed by VQSR? (Seems to me the VQSR is to compute the VQSLOD which is used to set the FILTER field). Thanks!

• Member Posts: 266 ✭✭

does the term "Quality Score" refer to the QUAL field?

@blueskypy No, Variant Quality Score in the context of VQSR refers to the VQSLOD, which is distinct from QUAL

Geraldine Van der Auwera, PhD

• Member Posts: 266 ✭✭

@Geraldine_VdAuwera
Don't mean to be picky! While the BQSR is a perfect term, the VQSR is kind of misleading. Is it actually the creation (instead of recalibration) of VQSLOD, a new ID to better represent VQ (instead of VQS)?

That is a very good way to put it, yes.

Geraldine Van der Auwera, PhD

• HKMember Posts: 1

@Sheila said:
@blueskypy
Hi,

There is some fancy math involved and I am in the process of writing a document to explain it, but for now I hope you can accept this answer:

The QUAL score is the Phred-scaled posterior of AC = 0. We use the AC priors and the PLs to get the likelihood of the data given each AC, then use those to get the posterior probability for each AC. From there, the calculation is 1 - Pr{AC > 0}.

-Sheila

@Sheila:
Hi Sheila,

I'm trying to figure out indeed how does the software calculate the QUAL, and I'm a little bit confused why do we need the AC prior. Is it based on the binomial distribution with the given AF, or a certain mutation rate setting as default value (say, 10E-3) for GATK?. It might be naive but I couldn't find any official documentation.

Or, can we get the probability of homozygous reference allele (0/0) calling for each individual based on the PL, then for the situation AC=0, the probability would be the multiplication of probability when all individual calling 0/0? Since we are aiming to estimate the potability that whether certain site have any positive signal.

Best,
Fred

Issue Number
714
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera