This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
HaplotypeCaller Scores that are multiple of 3 indicate much better concordance than those that are n
I was comparing the concordance of genotype calls as a function of Genotype Quality (GQ) for the GATK 3.5 Haplotyper.
(using genotype given allele and emiting all sites, even those with score of 0).
My "truth" are genotypes from Illumina OMNI 2.5M arrays .. and I am comparing the genotype calls from Illumina exome arrays.
I notice that the Genotype Quality (GQ) scores of the HaplotypeCaller are focussed at intervals of "3".
e.g. many many more scores at (orders of magnitude)
GQ=1,2, 4,5, 7,8, 10,11 , 13,14
This unequal distribution in itself is surprising, but not an issue.
I am noticing that the rate of concordance is NOT monotonically proportional to GQ.
Those genotypes with non-multiple of 3 scores have much worst concordance.
If I limit to multiple of 3, the GQ are monotonic.
but systematically, the non-multiple of 3 have worst concordance by orders of magnitude.
e.g. Concordance(10 or 11) <Concordance(3)
Why is that?
So .. trying to look at the source code, for PairHMM (where the likelyhoods are computed),
I am identifying this value "TRISTATE" correction (value 3.0) that is differentially applied to reads with "N".
but I don't see 'N" in alignments for those non-multiple of 3 alignment.
I looked at the "raw" bam file (not the output of haplotyper) for
number of examples for scores from 5 to 22 .. coverage 2..14
None has pathologic features.
- no N in alignment
- only example one overlapped soft-clipped bases
- usually only 1 read supporting the alternate allele (or one read supporting the reference)
- usually all the reads have 100M cigars.
- Variant is not next to the end of a soft-clipped reads.
Is that a bug in the HMMPair scoring?
p.s. The BAM has 32-offset qualities.