Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GATK HaplotypeCaller MNP output problem.

SkyWarriorSkyWarrior TurkeyMember ✭✭✭
edited December 2018 in Ask the GATK team

Hi
I have a situation here where a clear MNP is not literally emitted by the HaplotypeCaller engine.

The top VCF is where MNP distance is 0 and the bottom is 5. Looks like GATK HC engine is too INDEL centric for this kind of situations where it prefers emitting a GC deletion and an AA insertion rather than TGC - AAT substitution. I am wondering if this could be related to the pairHMM and its limitations in certain transitions (Not allowing 1 transversion and 2 transitions at the same time at the expense of introducing gaps in both sequences) types. And if that's the case I am also wondering if an alternative engine may also be implemented in GATK that does not rely on pairHMM only to detect sequence differences.

I can submit snippets of this if you would like to check.

Thanks.

Best Answers

  • SkyWarriorSkyWarrior Turkey ✭✭✭
    Accepted Answer

    Thanks for the detailed response. Second paragraph was exactly my thoughts so it is confirmed. I am not too agitated to get this MNP by default since seeing a weird pattern in VCF already tells me to look there manually.

    I will look forward to hearing any developments in this matter.

    Thanks again.

Answers

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @SkyWarrior What version are you using? An improvement from 4.0.5.2 -- we realize that our default Smith-Waterman parameters for aligning reads to their best haplotype was taken blindly from evolutionary models of sequence divergence, which have nothing to do with the error probabilities of sequencers -- might address this Previously the engine was sometimes favoring indels over SNPs.

    Just to be clear, are your vcfs generated by HaplotypeCaller with --max-mnp-distance 0 (default) and --max-mnp-distance 5?

    If the issue remains, could you post an IGV image of the bamout with --max-mnp-distance 5 here? After that we can proceed to examining bam snippets if we need to.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited December 2018

    This was called by GATK 4.0.11.0.

    here are the bamout images.

    Top is the --max-mnp-distance 0 and bottom is 5.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    Apologies for not saying this earlier: could you also show all the artificial haplotypes in the bamout?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Sure here they are.

    These 2 bamouts were generated with ALL_POSSIBLE_HAPLOTYPES option. They go with the same order.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @SkyWarrior I don't have an answer yet. Could you post a bam snippet with a padding of 1000 bp on either side?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Sure I will.

    This is mapped to hs37d5 from 1000G.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    Accepted Answer

    Thanks for the detailed response. Second paragraph was exactly my thoughts so it is confirmed. I am not too agitated to get this MNP by default since seeing a weird pattern in VCF already tells me to look there manually.

    I will look forward to hearing any developments in this matter.

    Thanks again.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited March 30

    Hi I am back again with another clearly misrepresented haplotype and looks like it propagated into many databases already like that.

    This haplotype includes a 4 bp insertion and 2 SNPs however gatk HaplotypeCaller all the way from 3.7 to 4.1.1.0 prefers to represent it by 2 2bp insertions and 1 SNP.

    To me both looks ok however 4bp insertion and 2 SNPs looks more conventional and the indel is more left aligned in that case.

    What do you think? -150 mismatch penalty and -11 gap extension penalty seems to be acting weird here again.

    Top VCF is gatk 4.1.1.0 HC and Bottom vcf is 3.8.1 UnifiedGenotyper

    CRAM and Bamout are also together. This is not the only sample showing the behavior. Heterozygous and homozygous samples all show the same.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @SkyWarrior

    We are looking into this and will get back to you shortly.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @SkyWarrior

    David's comments earlier on this thread still applies, the penalty values we use for aligning haplotypes to the reference give a perhaps undesirable result in this case. Although there are cases (such as this one) where the results from using the current parameters aren't the best, it's not clear exactly what a "better" set of parameters would be. That is, a set of parameters that perhaps treats this case better, will lead to less desirable representations in other cases. Because we don't have a clear better option, the inclination currently is to leave the implementation as is for the time being.

Sign In or Register to comment.