Using HaplotypeCaller to detect low-level mosaics
We are interested in calling variants on a sample that may contain a small amount of mosaicism. For instance, if there are 300 reads over a particular base, and six of those reads contain an SNV or InDel, then we would like a call to be made. Currently, the six reads are filtered out by HaplotypeCaller, and do not even appear in the GVCF output.
The decision by HaplotypeCaller appears to be based on the reasonable assumption that a variant in a haploid organism must be homozygous reference, heterozygous with a 50:50 ratio, or homozygous alternate. Therefore, the probability of a location being in each of these three states can be calculated based on the sequencing error rate as estimated by the base quality score (for the homozygous options) and the binomial distribution (for the heterozygous option). With these assumptions, an allele depth of 294,6 would have a very low probability for all three options, as it does not seem reasonable to have six errors in the same place, and it also does not seem reasonable to obtain an unbalance of 294,6 with a random selection with probability 0.5
We would like it very much if there was an option to relax the probability calculation for heterogyzous variants, so that there is no longer the assumption that the allele balance is 50:50. This would increase the chance that a variant is called when there is a very strong mis-balance, as it would no longer be a low probability event to report a 294,6 allele depth.
For extra bonus points, HaplotypeCaller could mark in the vcf file when a variant is more likely to be due to mosaicism than 50:50 heterozygosity.