Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Algorithm question for VQSR
As for as I understand, VQSR selects a pool of SNP existing in both testing set and know annotated SNP database. These SNP will be considered as true variants and a Gaussian mixture model is established based on the features of these true variant to classify additional SNP.
These true SNPs will be clustered using Gaussian model. However, Gaussian mixture model means we are also cluster "bad" SNPs as well. I imagine that these "bad" SNPs have different poor qualities on different direction and the finally the Gaussian mixture model will make multiple clusters (one true SNP cluster and multiple bad SNP clusters), right?
Then Why can't we just use a simple Gaussian model to just draw distribution of true SNP and any SNPs far from this cluster will more likely to be false?