Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Expected number of variants in WES data

I'm interested to know the approximate number of high quality variants reported in WES data. According to this page on the GATK site - https://software.broadinstitute.org/gatk/documentation/article?id=6308 , this is about 41k. However I observe only about 30 - 32k variants in exome data for CDS regions (Gencode v19). I see similar numbers when I analyzed the 1000 genomes phase 3 calls. I'm not sure why the numbers I get are lower. Any information about the reference for those numbers and what region was used and ranges usually observed would be helpful.




  • bshifawbshifaw Member, Broadie, Moderator admin

    As for the reason why there would be variations among samples, this is stated in the paragraph below the table.

    Number of Indels & SNPs The number of variants detected in your sample(s) are counted separately as indels (insertions and deletions) and SNPs (Single Nucleotide Polymorphisms). Many factors can affect this statistic including whole exome (WES) versus whole genome (WGS) data, cohort size, strictness of filtering through the GATK pipeline, the ethnicity of your sample(s), and even algorithm improvement due to a software update. For reference, Nature's recently published 2015 paper in which various ethnicities in a moderately large cohort were analyzed for number of variants. As such, this metric alone is insufficient to confirm data validity, but it can raise warning flags when something went extremely wrong: e.g. 1000 variants in a large cohort WGS data set, or 4 billion variants in a ten-sample whole-exome set.

    Team -
    "The number of variants you get for exome datasets can easily vary based on what population you're examining, or what exome capture kit you used. Thus, 30k is a perfectly reasonable number."

    As for where that number is coming from, I believe it's referencing the paper cited in the paragraph A global reference for human genetic variation

Sign In or Register to comment.