We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

bugs in the Allele Frequencies in hapmap_3.3.hg19.vcf from the 2.3 resource bundle?

Hello,

I am trying to understand the data in the file hapmap_3.3.hg19.vcf from the 2.3 resource bundle.

I was spot checking the data to make sure I understood how the data was structured, but i'm getting information I can't reconcile. I believe I may have found some bugs. I suspect this may be caused by a change in refBase in HG18 -> HG19.

Please excuse me if I am simply misunderstanding the data.

From the documentation page on VCF 4.1 from 1000 genomes.

ALT = comma separated list of alternate non-reference alleles called on at least one of the samples.

AA = ancestral allele

AC = allele count in genotypes, for each ALT allele, in the same order as listed

AF = allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes

AN = total number of alleles in called genotypes

#CHROM  POS     ID          REF ALT QUAL FILTER INFO
chrM    1189    rs28358571  T   C   .    PASS   AC=6;AF=0.011;AN=536

As I understand the definitions,

T is reference allele with 536-6 = 530 count

C is Alternative allele with 6 count.

C's frequency is 6/536 ~= 0.011

However, I very often see lines of data like the following

#CHROM  POS     ID          REF ALT QUAL FILTER INFO
chr1    107352  rs4124251   T   A,G .    PASS   AC=416,194;AF=0.682,0.318;AN=610

1) Total Allele frequency should always equal 1.

2) 3 Alleles exist, T,A and G.

3) AF should mark allele frequencies for the alternative alleles.

4) AF = 0.682,0.318 which means
A freq = 0.682 and G freq = 0.318

5) A freq + G freq = 1.

6) There is nothing left to indicate the frequency of the REF allele, T.

Having two alternative alleles does not always cause this problem.
For example, this one appears to be correct.

#CHROM  POS     ID          REF ALT QUAL FILTER INFO
chr1    717485  rs12184279  C   T,G .    PASS   AC=2,412;AF=2.101e-03,0.433;AN=952 

Also, sometimes things are wrong with single alternative allele SNPs.

#CHROM  POS     ID          REF ALT QUAL FILTER INFO
chr1    546697  rs12025928  A   G   .    PASS   AC=534;AF=1.00;AN=534
chr1    565490  rs7349153   T   C   .    PASS   AC=520;AF=1.00;AN=520
chr1    565937  rs7417504   T   C   .    PASS   AC=534;AF=1.00;AN=534

These say that the alternative allele has a frequency of one, which is seems wrong, as the REF allele must should have some genotypes.

May I suggest

1) If Alt Allele count >= 1, AC can not equal AN.

2) If Alt Allele count > 1, sum(AF) can not equal 1.

3) if Alt Allele count >1, sum(AC) can not equal AN

Thank you,

John

Best Answer

Answers

  • John_PanJohn_Pan Member

    Hi Eric,

    Thanks for the quick reply.

    I suppose i'm just misunderstanding the data then.

    Is the reference base coming from HG19, and not the allele with the major fraction in hapmap?

    if so, than I understand my mistake, although it seems odd that the refbase is never seen for some SNPS in the ~1000 hapmap samples.

    Thank you,

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Yes, the reference allele is from hg19 (as per the VCF specification).

Sign In or Register to comment.