I'm working with some previously sequenced and analyzed data, and I am curious if there is a bias to the reference base. Specifically, about 3 years ago, our group sequenced ~40 genomes of C. albicans patient isolates. These isolates were sequenced at relatively low levels of coverage (~25x), and validation of variant sites with sequenom revealed some problems in SNP calling.

My question is this - does GATK (specifically older versions of GATK, from ab. 2011) default to the reference base, or prefer the reference base, at areas of low coverage? More generally, are there any circumstances under which GATK would be biased towards reporting the reference base as opposed to either a (1) variant base, or (2) an indeterminant base?

