Poor VQSR filtering

rmf
edited January 11

I ran VQSR on my vcf file from joint genotyping. I used dnSNP as training. The plots generated during VQSR don't seem to separate the pos and neg very well. Below are the plots for one sample.

I use --ts_filter_level 99.0 during recalibration. And this is an example of the applied score for example;

##FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -39616.7976">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -39616.7976 <= x < -6.9367">

Of the 26 million SNPs, only 32,000 are filtered out by VQSR, so I am not sure if this is working.
I was wondering what would be the expert opinion looking at these plots. Are the VQSLOD scores usable?

To get an idea of the distribution of VQSLOD values, I plotted a histogram of around 10,000 scores sampled from the first 1 million variants in the vcf file. Shown for SNPs and INDELs separately.



It looks like there are three peaks. Any ideas on that? Could that be used for filtering?

Also, I am working on Zebrafish and not Human.


  rmf

    I did some independent lenient hard filtering and then compared that with the respective vqslod scores. So the 3 vqslod peaks does in a way represent low quality, mixed quality and high quality variants.

  Sheila Broad Institute


    Is dbsnp the only resource available for zebrafish? It may be the case that only one resource is not good enough for VQSR.

    For hard filtering, have a look at this document. Perhaps plotting the actual annotations will help more than the VQSLOD scores.


