Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Variant types confusion

santiagorevalesantiagorevale ArgentinaMember

Dear GATK team,

I'm a bit confused about the term MIXED (and maybe SYMBOLIC), because I believe it's being differently used among softwares.
If I understand correctly from the FAQ "What types of variants can GATK tools handle?" we have:

  • MIXED (combination of SNPs and indels at a single position)
    E.g. Reference = 'T', Sample = 'A,TCC'
    Here, we say it's MIXED because it combines 2 variant types (SNP, INS) for this position; we are talking about two possible alleles.

  • SYMBOLIC (generally, a very large allele or one that's fuzzy and not fully modeled; i.e. there's some event going on here but we don't know what exactly)
    E.g. Reference = 'GC', Sample = 'TTA'
    Is this example correctly classified for what SYMBOLIC stands for?

In the other hand, I've been using SnfSift (from SnpEff package) to filter variants, but when I tried to grab what I understood MIXED variants were, I've got a different result as oppose to using GATK. While checking its manual, I found what seems to be a different definition for MIXED:

  • MIXED: Multiple-nucleotide and an InDel.
    E.g. Reference = 'ATA', Sample = 'GTCAGT'

I believe SnpEff MIXED definition of variant type is equivalent to GATKs SYMBOLIC definition, am I right?

I've been told one thing is a) "MIXED variant" and another b) "MIXED variant call record". GATK is using MIXED as b) while SnpEff is using it as a).

Is there an official definition for these stuff? Are any of these softwares wrong?

Thank you very much for your help.

Sincerely,
Santiago

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @santiagorevale
    Hi Santiago,

    I tried it out, and it looks like GATK follows the VCF spec format. Have a look at the spec for more information: http://www.1000genomes.org/wiki/analysis/variant call format/vcf-variant-call-format-version-41

    -Sheila

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @santiagorevale just a small comment; UG emits SNPs and indels at the same position separately, whereas HC emits them as one VCF record. There are many tools to split your multiallelic sites into biallelic sites.

  • santiagorevalesantiagorevale ArgentinaMember

    Hi Sheila,

    I'm still confused.

    When you talk about alleles in the above mention FAQ, what's the difference between MIXED and SYMBOLIC? Could you give me an example?

    Are these two examples correct?

    • MIXED: e.g. Reference = 'T', Sample = 'A,TCC'
    • SYMBOLIC: e.g. Reference = 'GC', Sample = 'TTA'

    Thanks in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    edited July 2015

    The first is correct. The second is not -- it's a MNP or complex substitution. A symbolic allele would be something like * or <NONREF>, where the allele is not an actual representation of nucleotides, but instead, a symbol that represents an allele that is only partly determined if at all.

    Post edited by Geraldine_VdAuwera on
  • santiagorevalesantiagorevale ArgentinaMember

    Thanks, Geraldine.

    So how would GATK classify this second example? Because it's not an MNP (it should be the same number of nucleotides) but it's more like a complex substitution.

    Is there a name for this type of complex substitutions? Should it be documented in any specification? SnpEff calls this type of variant a MIXED variant but I couldn't find any source that controls this vocabulary. Am I missing it?

    Thanks again.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, it's like a mix of a MNP and an insertion (so complex substitution is an appropriate catch-all name). I think GATK would probably consider it MIXED but I'm not sure, you'd have to test e.g. SelectVariants on it with the variant type argument.

    If anyone controls this vocabulary it's GA4GH and the hts-spec group: https://github.com/samtools/hts-specs

Sign In or Register to comment.