Determining ploidy via read ratios with HaplotypeCaller: skewed results?
I am attempting to determine the ploidy of my organism by observing where the modes of SNP allele ratios fall. For example, a 2N diploid should have a peak of allele read-counts at 0.5 since most SNPs should have, for example, 15 reference and 15 alternate reads. A triploid should have two modal peaks around 0.33 and 0.66, and a tetraploid should have four peaks around 0.25, 0.5, and 0.75.
I am using HaplotypeCaller to call SNPs on a segregating cross in gVCF+joint genotyping mode, then using SelectVariants to keep biallelic SNPs and using the JEXL expression " -select 'vc.getGenotype("sample").isHet()' " to isolate only those SNPs called as heterozygous, then taking the "-GF AD" option on SelectVariants and parsing the allele depths in the resulting table, and finally calculating the alternate allele read depth divided by the total read depth to get my final "ratio".
While this workflow seems to work and give a modal peak around 0.5, every sample previously known as a 2N diploid I try this on shows a significant shoulder on the left side of the density plot I generate (see attached). When I apply this to an artificial tetraploid (two 2N diploid individuals merged) I can see the expected three modal peaks around 0.25, 0.5, and 0.75, but again, there is a very significant bias toward the left.
My question is: Is this normal for HaplotypeCaller? Is there a particular reason why it is calling SNPs heterozygous only on one end of the spectrum (keeping SNPs with a low alt/total ratio, but not a high)? Or should I try a different SNP calling tool?