Improvements in Germline Short Variants Calling Part 2
A new QUAL score model means a faster, more accurate pipeline
The short variants pipeline features a new QUAL score model, which has been around for a while but is now the default in GATK 4.1. The QUAL score of an allele is the probability that at least one sample’s genotype contains at least one copy of that allele. Previously, HaplotypeCaller and GenotypeGVCFs used an exhaustive calculation that explicitly considered the probabilities of zero, one, two copies of the allele. . . all the way up to every sample being homozygous for that allele. This approach was slow -- quadratic in the number of samples. Worse, it was exponentially slower for multiallelic sites, which are common in modern studies like ExAC and gnomAD. In practice the QUAL score relied on dodgy heuristics for multiallelics to keep runtime reasonable, which had unfortunate consequences like two good alleles cancelling each other’s evidence and rare alleles being lost at sites with more common alleles.
The new QUAL model is very similar to the somatic likelihoods model that underlies Mutect2. The idea is that each allele has some unknown frequency in the population and that genotypes are generated by sampling alleles independently (that is, assuming Hardy-Weinberg equilibrium) from these frequencies. Combining these elements with the genotype likelihoods from HaplotypeCaller gives us a probabilistic graphical model that we can solve using the technique of mean-field variational Bayesian inference. In Mutect2 tumor allele fractions replace population allele frequencies and reads replace genotypes but otherwise it’s the same model. For users joint calling many samples, samples with high ploidy, or highly multi-allelic samples, the new QUAL model will dramatically decrease the GenotypeGVCFs memory usage, and everyone will benefit from its increased accuracy.
HaplotypeCaller bug fixes for any cohort size
An important HaplotypeCaller bug fix in the 4.1 release was prompted by a comparatively humble call set of 200 samples. An increased prevalence of ”no call” genotypes for samples that were clearly homozygous reference was biasing their allele frequency estimates for an important pathogenic mutation in their key disease gene. In GVCF mode, after HaplotypeCaller genotypes variant sites, it calculates a “reference confidence” for all other positions in the active region. The reference confidence likelihoods take the less confident of the likelihood of an uncalled SNP or an uncalled indel. Uncalled SNPs are modeled using the base qualities as a representation of the probability of a sequencing error at a variant base that just happened to match the reference. The indel model is more complicated. Reads are considered to be informative for the reference at a given position if the alignment based on the HaplotypeCaller assembly has fewer mismatches than any alternate alignment containing an indel of 10bp or fewer at that position. For example, at the edge of a CAAA repeat (as shown below), a read spanning the repeat and ending in unique sequence on the other side (such as the bottom two reads) will have more mismatches if an indel is introduced and is thus informative for reference likelihood calculation. Reads that end inside a repeat (such as the top two) could potentially have the same or fewer mismatches if an indel is introduced (e.g. at the position marked with dashed lines) and are thusly not informative with respect to the reference likelihoods. This heuristic works reasonably well in practice, but led to biased reference likelihoods near indels. In the old code, the read CIGAR was not being accounted for, so reads containing indels (like the two on the bottom) had a lot of mismatches in their original alignment. The hypothetical insertion of an indel (like the 4 bp deletion in the example) would then reduce the number of mismatches and make that a better alignment than the original, in which case the read cannot be said to support the reference. For the reads in the disease samples of interest, a homozygous deletion caused all samples to have no informative reads positions surrounding the indel (depending on the reference context) because of the CIGAR bug, leading to many GQ0 calls and allele frequency overestimates due to incorrect genotypes. Since making the change respecting the reads’ CIGARs, these no-call genotypes are fixed and reference confidences around indels are much more consistent. Cohorts of any size can benefit from improved reference confidences.
Figure caption: The indel reference confidence model becomes more important in repetitive regions.
The top two reads are considered uninformative at the position indicated with the dashed lines. The first read has one mismatch (highlighted in green by IGV). If a one base insertion is added at that position, there will still be a single mismatch. Since that’s not worse than the reference alignment, we can’t say with confidence that there isn’t a 1bp insertion, so we don’t count that read towards the indel reference confidence. However the bottom two reads are informative. If a one base insertion is added, the TGTT sequence will get shifted one to the right, resulting in eight additional mismatches. Since the reference alignment has no mismatches, we can be confident that that alignment is more likely than a 1bp insertion. Mismatches are evaluated similarly for 1 to 10 base pair insertions and deletions for each read at each position.