Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Genotype Concordance HG19 file? Which to pick?

JonRJonR IndianaMember

I'm looking for a standard high confidence VCF file to compare several dozen NA12891 samples for genotype concordance. We are testing different protocols and would like to determine which is most accurate & sensitive at calling variants. I see the resource bundle here:

ftp://ftp.broadinstitute.org/bundle/hg19/

But there are lots of options and I'm not sure what to pick. Any recommendations? Also, I've done a grep on some of these files and I can't find any NA12891

This file does however:

CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.vcf.gz

Should I go with that? Use a different file? Mix and Match?

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @JonR,

    NA12891 is the father of the NA12878 sample that folks typically use in standardizing calling. There are a mountain of NA12878 (and other trio) resources at the NIST (National Institutes of Standards and Technology) GIAB (Genome In A Bottle) website here. The term high-confidence typically is used on a set of genomic intervals in which we have high confidence in the calls. These genomic intervals are also provided by GIAB and can vary depending on the version you download.

    To get an introduction, check out the GATK tutorial Filtering and Evaluation available in this folder.

    If you are set on using NA12891, then perhaps check out Illumina Platinum Genomes and the 1000 Genomes Project for callsets.

    Good luck.

Sign In or Register to comment.