Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.5 is out. See the GATK4 beta page for download and details.

Biallelic vs Multiallelic sites

KateNKateN Cambridge, MAMember, Broadie, Moderator
edited December 2015 in Dictionary

A biallelic site is a specific locus in a genome that contains two observed alleles, counting the reference as one, and therefore allowing for one variant allele. In practical terms, this is what you would call a site where, across multiple samples in a cohort, you have evidence for a single non-reference allele. Shown below is a toy example in which the consensus sequence for samples 1-3 have a deletion at position 7. Sample 4 matches the reference. This is considered a biallelic site because there are only two possible alleles-- a deletion, or the reference allele G.

           1 2 3 4 5 6 7 8 9
Reference: A T A T A T G C G
Sample 1 : A T A T A T - C G
Sample 2 : A T A T A T - C G
Sample 3 : A T A T A T - C G
Sample 4 : A T A T A T G C G

A multiallelic site is a specific locus in a genome that contains three or more observed alleles, again counting the reference as one, and therefore allowing for two or more variant alleles. This is what you would call a site where, across multiple samples in a cohort, you see evidence for two or more non-reference alleles. Show below is a toy example in which the consensus sequences for samples 1-3 have a deletion or a SNP at the 7th position. Sample 4 matches the reference. This is considered a multiallelic site because there are four possible alleles-- a deletion, the reference allele G, a C (SNP), or a T (SNP). True multiallelic sites are not observed very frequently unless you look at very large cohorts, so they are often taken as a sign of a noisy region where artifacts are likely.

           1 2 3 4 5 6 7 8 9
Reference: A T A T A T G C G
Sample 1 : A T A T A T - C G
Sample 2 : A T A T A T C C G
Sample 3 : A T A T A T T C G
Sample 4 : A T A T A T G C G
Post edited by Geraldine_VdAuwera on

Comments

  • olavurolavur Member

    True multiallelic sites are not observed very frequently unless you look at very large cohorts, so they are often taken as a sign of a noisy region where artifacts are likely.

    I have a few questions on this subject:

    • Is there a reference for this? Some research?
    • Does this mean that when you have multiallelic variants in a VCF file, you can safely ignore them?
    • If your cohort is too small to support multiallelic sites, can you make sure your pipeline only produces biallelic sites, either in the alignment or in the variant calling phase?
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @olavur
    Hi,

    Have a look at "A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data". https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/

    -Sheila

Sign In or Register to comment.