To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at

g.vcf masks variants with conflicted reads

I'm working on Salmonella, and one of the major sources of noise in our analysis is that there are some regions of the genome where we have many reads that disagree with the eventual call at that position, like this:


Our working hypothesis is that this is due to a duplication in our sample, getting mapped to a single position in the reference. Whether the final call comes out as GT=0 or GT=1 depends on the randomness of which of the duplicates happens to be sequenced with greater depth.

However, when using g.vcf files, positions where the REF allele has the most reads become part of a homref block, and the fact that any reads disagree with that call is lost.

We also see this in the final vcf file, where we found 15831 cases where a GT=1 call had conflicting reads, and only 371 cases where a GT=0 call had conflicting reads.

I was wondering whether this was a known problem, and if someone has advice on how to deal with this.

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hi @Redmar_van_den_Berg, I'm surprised the call comes out with such a high GQ given the even split in the ADs. I would have expected that to result in a lower GQ, which would typically lead to having a single record for that site and therefore a more informative (because uncertain) ref call for the position. Can you tell me what version you're using?
  • First off, I should add that the vcf entries I've included are all from the final vcf file, not intermediate g.vcf files.

    My gatk version:

    $ gatk --version

    I've had a look at some more positions, and GQ:99 shows up almost always, even when there a different numbers of conflicting reads. Only the last call here has a lower GQ, presumably due to the conflicting reads.


    I should also add that at first glance GQ:99 appears to show up for both HaplotypeCaller and GenotypeGVCFs when there are conflicting reads. It's just that GenotypeGVCFs hide the conflicting reads when the majority match the reference.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Those PLs show high confidence in the genotypes, which makes it even more interesting. Can you post some original BAM files and bamout files of some questionable sites?


  • @Sheila
    I'm sorry for my late response, I somehow missed your reply.

    I've attached a zip file with the bam, bamout, g.vcf and final vcf file for 5 samples that we expect would be identical. I've filtered the final vcf file for conflicting reads (##FILTER=<ID=G_readfrac0.9,), please let me know if you need anything else.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Yes, thank you. I will have a look soon and talk to the team.


    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
Sign In or Register to comment.