Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

g.vcf masks variants with conflicted reads

I'm working on Salmonella, and one of the major sources of noise in our analysis is that there are some regions of the genome where we have many reads that disagree with the eventual call at that position, like this:


Our working hypothesis is that this is due to a duplication in our sample, getting mapped to a single position in the reference. Whether the final call comes out as GT=0 or GT=1 depends on the randomness of which of the duplicates happens to be sequenced with greater depth.

However, when using g.vcf files, positions where the REF allele has the most reads become part of a homref block, and the fact that any reads disagree with that call is lost.

We also see this in the final vcf file, where we found 15831 cases where a GT=1 call had conflicting reads, and only 371 cases where a GT=0 call had conflicting reads.

I was wondering whether this was a known problem, and if someone has advice on how to deal with this.

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Hi @Redmar_van_den_Berg, I'm surprised the call comes out with such a high GQ given the even split in the ADs. I would have expected that to result in a lower GQ, which would typically lead to having a single record for that site and therefore a more informative (because uncertain) ref call for the position. Can you tell me what version you're using?
  • First off, I should add that the vcf entries I've included are all from the final vcf file, not intermediate g.vcf files.

    My gatk version:

    $ gatk --version

    I've had a look at some more positions, and GQ:99 shows up almost always, even when there a different numbers of conflicting reads. Only the last call here has a lower GQ, presumably due to the conflicting reads.


    I should also add that at first glance GQ:99 appears to show up for both HaplotypeCaller and GenotypeGVCFs when there are conflicting reads. It's just that GenotypeGVCFs hide the conflicting reads when the majority match the reference.

  • SheilaSheila Broad InstituteMember, Broadie admin


    Those PLs show high confidence in the genotypes, which makes it even more interesting. Can you post some original BAM files and bamout files of some questionable sites?


  • @Sheila
    I'm sorry for my late response, I somehow missed your reply.

    I've attached a zip file with the bam, bamout, g.vcf and final vcf file for 5 samples that we expect would be identical. I've filtered the final vcf file for conflicting reads (##FILTER=<ID=G_readfrac0.9,), please let me know if you need anything else.

  • SheilaSheila Broad InstituteMember, Broadie admin


    Yes, thank you. I will have a look soon and talk to the team.


    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
Sign In or Register to comment.