Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

Position of Indel event based on the REF

rfuentesrfuentes PhilippinesMember

Hi,
How do I know based on the REF and ALT column of a VCF file the actual position where an indel event happened?
I usually see an event that occur in the 2nd base. For example,
REF ALT
GGCGTGGCGT G,GCGCGTGGCGT --deletion; insertion of "C"
ATTT A,ATT --deletions

But I saw a VCF format documentation(http://samtools.github.io/hts-specs/VCFv4.2.pdf) allowing a different case:
GTC G,GTCT --deletion; insertion of "T" at the end

How is UnifiedGenotyper formatting the indels? Does it differ
from the one used in the 1000Genome Project? Thank you!

Roven

Best Answer

Answers

  • tommycarstensentommycarstensen ✭✭✭ United KingdomMember ✭✭✭

    Hi @rfuentes. There was someone else asking about this recently and Geraldine or Sheila gave a really good answer, but I can't find the thread. Short answer is that you can't know which position was deleted. GATK to the best of my knowledge by default left aligns and trims indels.

  • rfuentesrfuentes PhilippinesMember
    edited February 2015

    @tommycarstensen @Sheila

    I found this http://gatkforums.broadinstitute.org/discussion/5020/location-of-variant-in-multi-sample-calling#latest
    But does that mean the insertion/deletion can happen in any position relative to the ref string? For example
    REF ALT
    ATTTG A,ATAATTG
    GGA G,GGACC

    Or is it always after the 1st base because of left align?
    CCG C,CAACG

    I'm trying to derive the actual deleted or inserted string.
    Thank you!

  • tommycarstensentommycarstensen ✭✭✭ United KingdomMember ✭✭✭

    Yes, that's exactly the answer from Sheila I had in mind. Thanks for locating it.

    I think you forgot to include positions for those variants you posted. I'm not sure I understand your question in its current format.

    This thread on left alignment of indels might also be of interest to you.

    UnifiedGenotyper does not always seem to left align indels, but I think HaplotypeCaller always does. I could be wrong.

  • rfuentesrfuentes PhilippinesMember

    @tommycarstensen

    A, sorry. It's not a real indel from a VCF file. I just made it for illustration.
    Most of the indel I see occur right after the first base:
    pos REF ALT
    40 CCG C,CAACG --deletion and insertion after "C"(pos 40)
    But is it possible that they may occur like these
    10 ATTTG A,ATAATTG --insertion after pos11
    76 GGA G,GGACC --insertion after pos78

    I just need to know where insertion or deletion begins so I can
    extract the substring. Thank you!

Sign In or Register to comment.