If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.
Incorrect Likelihood / AD Count For Variant
I am trying to figure out why a site is being called with the wrong AD / Likelihood and what can be done about it.
In particular, there is a site in my VCF that is being called with a very wonky AD value of 190,63, which is well outside the expected 50:50 ratio. The call in question is a deletion of 2 T's from a 7 T track (e.g. T -> T), and was made on a BAM file of a single subject using the haplotype caller with
--genotyping_mode DISCOVERY and otherwise the default values.
In tracking down this issue, it seems the algorithm has a strong bias for the reference. Stepping through the code, my impression is the HC is doing the following:
1 - Creating a set of 82 possible haplotypes
2 - Creating a candidate set of variants at each position (in this case, GTTTT*, GTT, GTTTTT, G)
3 - Assign each haplotype to one of the possible variants, take the highest "score" for a haplotype assigned to a variant, and make that the likelihood for that read/variant combination.
However, this process seems to be flawed. Examining the region, it looks like the process that assigns haplotypes to variants (Method [email protected]:1043), assumes that if a haplotype doesn't have an event at a particular location, it sould be assigned to the reference.
This however does not work when the haplotype does not match the reference at that location, which can occur for example if there is an upstream deletion that removes the reference sequence being considered. The graphic below shows all the haplotypes that were considered for each possible variant, clearly many things with the variant "TT" deletion are identified as being part of the reference, which seems to lead to skewed likelihoods and incorrect AD counts further downstream.
Is this a known issue with the haplotype caller? And if so, are there any workarounds?