If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Poor VQSR filtering
I ran VQSR on my vcf file from joint genotyping. I used dnSNP as training. The plots generated during VQSR don't seem to separate the pos and neg very well. Below are the plots for one sample.
--ts_filter_level 99.0 during recalibration. And this is an example of the applied score for example;
##FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -39616.7976"> ##FILTER=<ID=VQSRTrancheSNP99.90to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -39616.7976 <= x < -6.9367">
Of the 26 million SNPs, only 32,000 are filtered out by VQSR, so I am not sure if this is working.
I was wondering what would be the expert opinion looking at these plots. Are the VQSLOD scores usable?
To get an idea of the distribution of VQSLOD values, I plotted a histogram of around 10,000 scores sampled from the first 1 million variants in the vcf file. Shown for SNPs and INDELs separately.
It looks like there are three peaks. Any ideas on that? Could that be used for filtering?
Also, I am working on Zebrafish and not Human.