This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
bugs in the Allele Frequencies in hapmap_3.3.hg19.vcf from the 2.3 resource bundle?
I am trying to understand the data in the file hapmap_3.3.hg19.vcf from the 2.3 resource bundle.
I was spot checking the data to make sure I understood how the data was structured, but i'm getting information I can't reconcile. I believe I may have found some bugs. I suspect this may be caused by a change in refBase in HG18 -> HG19.
Please excuse me if I am simply misunderstanding the data.
From the documentation page on VCF 4.1 from 1000 genomes.
ALT = comma separated list of alternate non-reference alleles called on at least one of the samples.
AA = ancestral allele
AC = allele count in genotypes, for each ALT allele, in the same order as listed
AF = allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes
AN = total number of alleles in called genotypes
#CHROM POS ID REF ALT QUAL FILTER INFO chrM 1189 rs28358571 T C . PASS AC=6;AF=0.011;AN=536
As I understand the definitions,
T is reference allele with 536-6 = 530 count
C is Alternative allele with 6 count.
C's frequency is 6/536 ~= 0.011
However, I very often see lines of data like the following
#CHROM POS ID REF ALT QUAL FILTER INFO chr1 107352 rs4124251 T A,G . PASS AC=416,194;AF=0.682,0.318;AN=610
1) Total Allele frequency should always equal 1.
2) 3 Alleles exist, T,A and G.
3) AF should mark allele frequencies for the alternative alleles.
4) AF = 0.682,0.318 which means
A freq = 0.682 and G freq = 0.318
5) A freq + G freq = 1.
6) There is nothing left to indicate the frequency of the REF allele, T.
Having two alternative alleles does not always cause this problem.
For example, this one appears to be correct.
#CHROM POS ID REF ALT QUAL FILTER INFO chr1 717485 rs12184279 C T,G . PASS AC=2,412;AF=2.101e-03,0.433;AN=952
Also, sometimes things are wrong with single alternative allele SNPs.
#CHROM POS ID REF ALT QUAL FILTER INFO chr1 546697 rs12025928 A G . PASS AC=534;AF=1.00;AN=534 chr1 565490 rs7349153 T C . PASS AC=520;AF=1.00;AN=520 chr1 565937 rs7417504 T C . PASS AC=534;AF=1.00;AN=534
These say that the alternative allele has a frequency of one, which is seems wrong, as the REF allele must should have some genotypes.
May I suggest
1) If Alt Allele count >= 1, AC can not equal AN.
2) If Alt Allele count > 1, sum(AF) can not equal 1.
3) if Alt Allele count >1, sum(AC) can not equal AN