**Notice:**

If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

#### Test-drive the GATK tools and Best Practices pipelines on Terra

**Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.**

# Algorithm question for VQSR

As for as I understand, VQSR selects a pool of SNP existing in both testing set and know annotated SNP database. These SNP will be considered as true variants and a Gaussian mixture model is established based on the features of these true variant to classify additional SNP.

These true SNPs will be clustered using Gaussian model. However, Gaussian mixture model means we are also cluster "bad" SNPs as well. I imagine that these "bad" SNPs have different poor qualities on different direction and the finally the Gaussian mixture model will make multiple clusters (one true SNP cluster and multiple bad SNP clusters), right?

Then Why can't we just use a simple Gaussian model to just draw distribution of true SNP and any SNPs far from this cluster will more likely to be false?

## Answers

Can anyone share some comments? Really appreciate

@UniCorn

Hi,

I am not sure I understand? Can you clarify? Perhaps the presentation here on VQSR will help.

-Sheila

Hi Sheila

Thanks for answering. I imagine that false SNP calls are bad in different ways and they are hardly to form cluster. Why can't we just use simple Gaussian model to cluster real SNPs calls only (instead of using Gaussian mixture to model real SNPs and variety of false positive SNPs)? Any SNPs calls far from this cluster will be false positive.

The basic idea is that we build a positive model from variants in the VCF that are also in a known resource of common variants. This (Gaussian Mixture) model outputs a probability for each variant. We can think of that probability as measuring how much the variant looks like the known common variants. If it is .999 then the model is very confident that the variant is similar to the known common variants, and if it is 0.00001 then the model thinks the variant looks nothing like the common variants. We then look at variants in the VCF that have very low probabilities. To those variants we fit a negative model. The VQSR score for each variant is the ratio of the probability from the positive model over the probability from the negative model.

@UniCorn The motivation behind building a model for the negative variants as well as the positive ones is to try to avoid penalizing rare variants. Rare variants are not in the truth set of common variants and may look somewhat different from them in annotation space. If they are real variants they should also look different from the false positives that make up most of the negative model and so the two models will hopefully cancel each other out.