Truth & Control sources- HapMap and 1000G

dilawerkh4dilawerkh4 Member
edited August 2017 in Ask the GATK team

Hi everyone,

I apologize in advance if this question seems like a stupid one, but I have always thought that sources such as HapMap and 1000G from the resource bundle that we use in VQSR are comprised of many global samples, but when I peaked inside of the vcfs, I only saw a reference and alternate allele for seemingly 1 sample only. What am I missing here?

If the multisample genotype info is somehow Incorporated into the vcf index file then is there a way to display the contents of the index file so that I can remove all African samples since they are totally irrelevant to my test sample and seem to be negatively affecting The calibration and the calls for my test sample

Post edited by dilawerkh4 on

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @dilawerkh4,

    These resource files are sites-only VCFs that summarize findings for large cohorts. If you need to generate your own cohort callset, you should check out the data available from 1000 Genomes Project as well as gnomAD. The latter is the whole genome equivalent portal of ExAc (for exomes). It's my understanding that both ExAc and gnomAD include 1000 Genomes Project samples and so you may find these newer data portals organize data in a manner that is more accessible to you, e.g. into data frames such as GenomicsDB. I believe they also provide callsets that have been stratified by population.

    GATK's focus is more on software tools and how to use them. So please followup with the respective data providers for more discussion.

  • Thanks for the resource Shlee . Correct me if I misunderstood but I think what do you mean is that the VCF index file does not contain genotype information for the individual samples that comprise HapMap vcf nor does it contain individual sample IDs for those samples , and thus were not able to remove samples from it

    I may have misunderstood Geraldine but in another thread I thought she had mentioned that we could add samples to the HapMap vcf

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @dilawerkh4
    Hi,

    You are correct that the VCF index file does not contain genotype information or individual sample IDs.

    I suspect the adding samples may have had to do with adding samples to your exome cohort so you can use VQSR. You can add samples from 1000Genomes that are similar to your samples if you do not have at least 30 exome samples to run VQSR. If you search the forum, you can find threads related to that.

    -Sheila

Sign In or Register to comment.