Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How to evaluate a call set when no true snp set is available?

JulsJuls Member ✭✭

Hi,

I would like to know how to evaluate a call set when no true snp set is available. Are there recommendations? Certain values one can look at? VariantEval seems to focus on comparing with a reference of known variants. But maybe there are at least a couple of basic values that one can evaluate?

Also, is there a ballpark percentage of SNPs that is usually filtered by the hard filtering recommendations? I am aware that you cannot give me a precise number here but I would like to know if I should expect 10-30 % to get kicked out or more like 60-80 % to get kicked out.

Thanks so much!!

Answers

  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @Juls,

    If you can not find training or truth sets then you'll need to create your own as mentioned in the following document

    There is no sure answer to the percentage of SNPs being filtered out. It depends heavily on the quality of the call set and whether the researcher is willing to lose true positives to avoid false positives, or maximise the true positives with the risk of having false positives.

    Filtering is about balancing sensitivity and precision for research aims. ​For example, genome-wide association studies can afford to maximize sensitivity over precision such that there are more false positives in the callset. Conversely, downstream analyses that require high precision, e.g. those that cannot tolerate false positive calls because validating variants is expensive, maximize precision over sensitivity such that the callset loses true positives.

    View the following tutorial for instructions on hard filtering: (How to) Filter variants either with VQSR or by hard-filtering

Sign In or Register to comment.