Heterozgyous variants have unbalanced allele depth distribution
I was inspecting the frequency of the alternative allele at heterozygous variant sites in 16 exome sequencing samples. In each sample, I selected heterozygous variants with just one alternative allele and computed the frequency of the alt allele as AD(ALT) / (AD(REF) + AD(ALT)). Since low coverage variants should have a high variance, I restricted to variants with AD(REF) + AD(ALT) >= 40 (though the distribution of the frequency was similar even if I did not). My expectation was that the mean and median is around 0.5 (naturally with some variance), but it is rather around 0.45 (please see the attached boxplot). I am aware that the AD FORMAT field is the unfiltered allele depth excluding uninformative reads, but in my opinion any filtering should affect reads supporting the REF or ALT allele in the same way. Something similar was reported in this post, without any conclusions. Do you have any explanation for this?
Reads were generated on an Illlumina platform with Nextera exome kit. BWA + GATK 3.4-46 was used following the best practice recommendations (sorry about the old version, this is data I worked on a while back and I did not rerun the samples). Joint variant calling with HC was performed on ~100 exome samples.