g.vcf masks variants with conflicted reads
I'm working on Salmonella, and one of the major sources of noise in our analysis is that there are some regions of the genome where we have many reads that disagree with the eventual call at that position, like this:
Our working hypothesis is that this is due to a duplication in our sample, getting mapped to a single position in the reference. Whether the final call comes out as GT=0 or GT=1 depends on the randomness of which of the duplicates happens to be sequenced with greater depth.
However, when using g.vcf files, positions where the REF allele has the most reads become part of a homref block, and the fact that any reads disagree with that call is lost.
We also see this in the final vcf file, where we found 15831 cases where a GT=1 call had conflicting reads, and only 371 cases where a GT=0 call had conflicting reads.
I was wondering whether this was a known problem, and if someone has advice on how to deal with this.