Biallelic vs Multiallelic sites

KateNKateN Cambridge, MAMember, Broadie, Moderator
edited December 2015 in Dictionary

A biallelic site is a specific locus in a genome that contains two observed alleles, counting the reference as one, and therefore allowing for one variant allele. In practical terms, this is what you would call a site where, across multiple samples in a cohort, you have evidence for a single non-reference allele. Shown below is a toy example in which the consensus sequence for samples 1-3 have a deletion at position 7. Sample 4 matches the reference. This is considered a biallelic site because there are only two possible alleles-- a deletion, or the reference allele G.

           1 2 3 4 5 6 7 8 9
Reference: A T A T A T G C G
Sample 1 : A T A T A T - C G
Sample 2 : A T A T A T - C G
Sample 3 : A T A T A T - C G
Sample 4 : A T A T A T G C G

A multiallelic site is a specific locus in a genome that contains three or more observed alleles, again counting the reference as one, and therefore allowing for two or more variant alleles. This is what you would call a site where, across multiple samples in a cohort, you see evidence for two or more non-reference alleles. Show below is a toy example in which the consensus sequences for samples 1-3 have a deletion or a SNP at the 7th position. Sample 4 matches the reference. This is considered a multiallelic site because there are four possible alleles-- a deletion, the reference allele G, a C (SNP), or a T (SNP). True multiallelic sites are not observed very frequently unless you look at very large cohorts, so they are often taken as a sign of a noisy region where artifacts are likely.

           1 2 3 4 5 6 7 8 9
Reference: A T A T A T G C G
Sample 1 : A T A T A T - C G
Sample 2 : A T A T A T C C G
Sample 3 : A T A T A T T C G
Sample 4 : A T A T A T G C G
Post edited by Geraldine_VdAuwera on

Comments

  • olavurolavur Member

    True multiallelic sites are not observed very frequently unless you look at very large cohorts, so they are often taken as a sign of a noisy region where artifacts are likely.

    I have a few questions on this subject:

    • Is there a reference for this? Some research?
    • Does this mean that when you have multiallelic variants in a VCF file, you can safely ignore them?
    • If your cohort is too small to support multiallelic sites, can you make sure your pipeline only produces biallelic sites, either in the alignment or in the variant calling phase?
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @olavur
    Hi,

    Have a look at "A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data". https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/

    -Sheila

  • Hi,
    I am doing an undergrad project on the apolipoprotein gene and wondering whether this is gene is triallelic or multi allelic, as it is not really clear in the literature? I know it has the alleles E2, E3, and E4 but how would you find/interpret the other alleles on databases such as the NCBI? or does this database always assume genes to be diallelic? (It's quite complex to get around for those with little background in genetics).

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited February 19

    @genes
    Hi,

    I hope this dictionary entry will help you.

    -Sheila

    EDIT: Sorry, I just realized I linked you to the same article you posted in. We usually refer to any sites that have more than two alleles present as multiallelic. In your case, having three alleles is triallelic, but we also refer to it as multiallelic. I am not sure about "how would you find/interpret the other alleles on databases such as the NCBI? or does this database always assume genes to be diallelic?" What are you exactly trying to accomplish?

Sign In or Register to comment.