Was a different reference used in the HapMap samples VCF file in Bundle 1.5?

thondeboerthondeboer Posts: 15Member
edited October 2012 in Ask the GATK team

Hi,

We are sequencing some of the HapMap samples (NA19240 for instance) and we compared the calls we get for the HapMap samples with the calls that are reported for these samples in the file "hapmap_3.3.b37.vcf" that was part of the GATK bundle 1.5.

We are surprised to find a lot of discordant calls, but when we verified the calls, we could see no evidence in the HapMap sequence data for many of them.

We see the discordance at many sites and the vast majority of those show multiple alleles for the positions (like this one:

13 32914977 rs11571660 A C,T . PASS AC=1,2785;AF=0.00036,0.99964;AN=2786;set=Intersection GT 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 2/2 ... )

We call many of these sites as Homozygous REF...

When I inspect the header I see this in the header:

##reference=Homo_sapiens_assembly18.fasta

But the file seem to suggest it was done against b37 which implies hg19... Was this file created with liftover?

Also...Could it be possible that HG19 was updated to reflect the most common allele in the population (as you can see of the MAF of 0.00036 for this example) when it went from HG18 to HG19, but that the VCF file was not updated for the HapMap 3.3 samples to reflect this?

So the bottom line is, that the HapMap3.3 file could not be used to rely on the actual calls for the HapMap samples, since about 10% of the sites show variant calls, while the HG19 reference shows no variance...

I know that HapMap is used for different purpose in GATK (Variant Score Calibration), but we may want to warn the public that you cannot use the file as a source of variation for the HapMap samples it reports on...

Thanks

Thon

Post edited by Geraldine_VdAuwera on

Best Answer

  • ebanksebanks Posts: 684GATK Developer mod
    Answer ✓

    Hi Thon,

    It looks like the raw HapMap files are bad. The docs claim that all alleles are on the forward strand, but it's not actually the case. Here's the CEU record for your example:

    rs11571660 A/G chr13 31812977 + ncbi_b36 bbs urn:lsid:bbs.hapmap.org:Protocol:Phase3.r3:1 urn:lsid:bbs.hapmap.org:Assay:Phase3.r3_rs11571660:1 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:4 QC+ TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT

    Note that the A/G alleles are on the forward strand (as confirmed by the + in the record) but all genotypes are given on the negative strand (as TT instead of AA). We need to think about the best way to proceed for such problematic cases...

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Answers

  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    Hi there, yes I believe these have all been lifted over from hg18, and it likely generated these errors you are seeing. As far as I can tell the latest HapMap release 3.3 is only available on b36 so it has to be lifted over.

    http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/latest/forward/non-redundant/

    Perhaps better would be a straight remapping of the probes directly on b37.

    That said, I'm very surprised to hear that you have a lot of discordances. I'm assuming you mean that in as an absolute count, yes? In general, best you can do against even a very high quality genotype chip is like ~99.5%. If you require concordance or filter the sites down in some way I think you can achieve 99.9%, but that's mostly because you are excluding sites that have multiple alleles, that are ambiguously mapped, or that are actually incorrect genotypes (i.e., are actually indels or MNPs but only the first allele is tagged).

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • thondeboerthondeboer Posts: 15Member

    The discordance I find is more like 10%. I suspect that the HapMap consortium used a different reference...Hence in my example, almost all the samples are 2/2 which would suggest T/T hom var, but that site is clearly A/A and it is very hard to imagine that ALL hapmap samples were wrong like this...The HapMap file actually reports AA when I go straight to their site, so it seems the sequencing is correct...It is more likely that in HG18 the A is the alternative allele and T would be the reference, while in HG19, A is the reference allele and T is the second alt allele....Just the fact that it is marked as 2/2 in the VCF file is what is wrong...I'll have a look at HG18 for this position and compare with HG19...But there could also be an error in the VCF creation that you (?) do to create the HapMap 3.3 file for b37....

  • thondeboerthondeboer Posts: 15Member

    I checked HG18 and in this example it has the same reference (A) so that cannot be the issue. I further checked the rs number and interestingly the only reported variant for this position is A/G...The ONE variant that is not reported as an option for this position is G as an alternative allele. In stead, C and T are reported to be the alternative alleles...

    So i suspect some bugs in the conversion code for this position (in about 10% of the cases?)

    http://www.ncbi.nlm.nih.gov/snp/?term=rs11571660&SITE=NcbiHome&submit=Go

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,818Administrator, GATK Developer admin

    I'm filing this as a bug report, we'll take a closer look at it asap. Thanks for bringing this to our attention!

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 684GATK Developer mod
    Answer ✓

    Hi Thon,

    It looks like the raw HapMap files are bad. The docs claim that all alleles are on the forward strand, but it's not actually the case. Here's the CEU record for your example:

    rs11571660 A/G chr13 31812977 + ncbi_b36 bbs urn:lsid:bbs.hapmap.org:Protocol:Phase3.r3:1 urn:lsid:bbs.hapmap.org:Assay:Phase3.r3_rs11571660:1 urn:lsid:dcc.hapmap.org:Panel:CEPH-60-trios:4 QC+ TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT

    Note that the A/G alleles are on the forward strand (as confirmed by the + in the record) but all genotypes are given on the negative strand (as TT instead of AA). We need to think about the best way to proceed for such problematic cases...

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • thondeboerthondeboer Posts: 15Member

    Thanks for the clarification...We actually also got some original files (will post later what we got exactly) and in THAT file, this particular genotype is listed as AA in most cases...So not ALL the files are bad from HapMap...Puzzling...

  • thondeboerthondeboer Posts: 15Member

    I checked with the person who got the files for us and he got it from here:

    ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2010-08_phaseII+III/forward/

    In that file (for the YRI branch at least) it seems that the genotype was annotated correctly.... We may not be using the same file as you did? This is getting weird... Can I ask what your exact source is?

    rs11571660 A/G chr13 31812977 + ncbi_b36 perlegen urn:lsid:perlegen.hapmap.org:Protocol:Genotyping_1.0.0:2 urn:lsid:perlegen.hapmap.org:Assay:25770.4763433:1 urn:LSID:dcc.hapmap.org:Panel:Yoruba-30-trios:1 QC+ NN NN NN NN NN NN NN NN NN AA AA AA AA AA AA AA AA AA NN NN NN AA AA AA NN NN NN AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA NN NN NN AA AA AA NN NN NN NN NN NN NN NN NN NN NN NN AA AA AA NN NN NN NN NN NN NN NN NN NN NN NN AA AA AA NN NN NN AA AA AA AA AA AA NN NN NN NN NN NN NN NN NN AA NN NN AA AA NN NN NN NN AA AA AA AA AA AA NN NN AA AA AA AA AA AA AA AA AA NN NN NN NN NN NN AA AA AA NN NN AA AA AA NN NN AA AA AA NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN AA AA AA NN NN NN NN NN AA AA AA AA AA AA AA AA AA AA AA AA NN NN NN NN NN AA AA AA NN NN NN NN NN NN AA AA AA NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN

  • ebanksebanks Posts: 684GATK Developer mod

    I'm creating a new HapMap resource from the 2010-08_phaseII+III data now and will make a new version of the bundle this week in connection with the upcoming GATK 2.2 release.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Sign In or Register to comment.