Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

HaplotypeCaller Scores that are multiple of 3 indicate much better concordance than those that are n

sicottesicotte Member
edited January 2016 in Ask the GATK team

I was comparing the concordance of genotype calls as a function of Genotype Quality (GQ) for the GATK 3.5 Haplotyper.
(using genotype given allele and emiting all sites, even those with score of 0).

My "truth" are genotypes from Illumina OMNI 2.5M arrays .. and I am comparing the genotype calls from Illumina exome arrays.

I notice that the Genotype Quality (GQ) scores of the HaplotypeCaller are focussed at intervals of "3".
e.g. many many more scores at (orders of magnitude)
GQ=0,3,6,9,12,15, ...
than at
GQ=1,2, 4,5, 7,8, 10,11 , 13,14

This unequal distribution in itself is surprising, but not an issue.

I am noticing that the rate of concordance is NOT monotonically proportional to GQ.
Those genotypes with non-multiple of 3 scores have much worst concordance.

If I limit to multiple of 3, the GQ are monotonic.
Concordance(GQ=12)>Concordance(GQ=9)>Concordance(GQ=6)

but systematically, the non-multiple of 3 have worst concordance by orders of magnitude.
e.g. Concordance(10 or 11) <Concordance(3)

Why is that?
So .. trying to look at the source code, for PairHMM (where the likelyhoods are computed),
I am identifying this value "TRISTATE" correction (value 3.0) that is differentially applied to reads with "N".
..
but I don't see 'N" in alignments for those non-multiple of 3 alignment.

I looked at the "raw" bam file (not the output of haplotyper) for
number of examples for scores from 5 to 22 .. coverage 2..14

None has pathologic features.
- no N in alignment
- only example one overlapped soft-clipped bases
- usually only 1 read supporting the alternate allele (or one read supporting the reference)
- usually all the reads have 100M cigars.
- Variant is not next to the end of a soft-clipped reads.

Is that a bug in the HMMPair scoring?

p.s. The BAM has 32-offset qualities.

Tagged:

Issue · Github
by Sheila

Issue Number
545
State
closed
Last Updated
Milestone
Array
Closed By
vdauwera

Answers

Sign In or Register to comment.