HaplotypeCaller not calling variants overlapping with active region and VCF/GVCF output discrepancy
We've been running regression tests of HaplotypeCaller against previous UnifiedGenotyper output and we found a locus where UG originally called a set of 4 low allele balance SNPs while HC produced no variant calls at all. The alignments around this region can be seen in the attached figure, with the putative SNPs on the right. Inspection of the debug output of HaplotypeCaller showed that it identified a large 78bp deletion (shown in red at the top of the figure) of a tandem near-repeat region. The couple of bp of discrepancies between the tandem repeats thus showed up as SNPs in the initial alignment instead of as the (real) large deletion. However, while the deletion was identified in the debug output, no call was produced. It appears that this is because the deletion spans outside of the initial active region (shown as the blue bar above the alignments). If I manually input an active region that contains the whole deletion using
--activeRegionIn, then the deletion is able to be called.
The second issue found here is that when using the manually input active region, the deletion is not output when using
--emitRefConfidence GVCF but is output when using
--emitRefConfidence None. Some source diving and debugging has shown that the confidence quality for this call is wildly different when using GVCF versus None: a phred-scaled variant quality of 4410 without GVCF output, versus a variant quality of 0 with the GVCF output. I can share the full outputs
We're using a custom version of GATK 3.6 that has the currently non-functional
--activeRegionIn argument fixed. I was able to replicate the missing variant call with GATK 3.7, but haven't tested the VCF/GVCF discrepancy in 3.7, since I don't have a locus that doesn't have the first problem letting me identify the discrepancy.