Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Expected number of variants in WES data

Hi,
I'm interested to know the approximate number of high quality variants reported in WES data. According to this page on the GATK site - https://software.broadinstitute.org/gatk/documentation/article?id=6308 , this is about 41k. However I observe only about 30 - 32k variants in exome data for CDS regions (Gencode v19). I see similar numbers when I analyzed the 1000 genomes phase 3 calls. I'm not sure why the numbers I get are lower. Any information about the reference for those numbers and what region was used and ranges usually observed would be helpful.

Thanks
Uma

Tagged:

Answers

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    As for the reason why there would be variations among samples, this is stated in the paragraph below the table.

    Number of Indels & SNPs The number of variants detected in your sample(s) are counted separately as indels (insertions and deletions) and SNPs (Single Nucleotide Polymorphisms). Many factors can affect this statistic including whole exome (WES) versus whole genome (WGS) data, cohort size, strictness of filtering through the GATK pipeline, the ethnicity of your sample(s), and even algorithm improvement due to a software update. For reference, Nature's recently published 2015 paper in which various ethnicities in a moderately large cohort were analyzed for number of variants. As such, this metric alone is insufficient to confirm data validity, but it can raise warning flags when something went extremely wrong: e.g. 1000 variants in a large cohort WGS data set, or 4 billion variants in a ten-sample whole-exome set.

    Team -
    "The number of variants you get for exome datasets can easily vary based on what population you're examining, or what exome capture kit you used. Thus, 30k is a perfectly reasonable number."

    As for where that number is coming from, I believe it's referencing the paper cited in the paragraph A global reference for human genetic variation

Sign In or Register to comment.