Best Practice pipeline for WES, including VQSR - what is best?
I am currently processing ~100 exomes and following the Best Practice recommendations for Pre-processing and Variant Discovery. However, there are a couple of gaps in the documentation, as far as I can tell, regarding exactly how to proceed with VQSR with exome data. I would be grateful for some feedback, particularly regarding VQSR. The issues are similar to those discussed on this thread: http://gatkforums.broadinstitute.org/discussion/4798/vqsr-using-capture-and-padding but my questions aren't fully-addressed there (or elsewhere on the Forum as far as I can see).
1) All samples processed with same protocol (~60Mb capture kit) - coverage ~50X-100X
2) Alignment with BWA-MEM (to whole genome)
3) Remove duplicates, indel-realignment, bqsr
4) HC to produce gVCFs (-ERC)
5) Genotype gVCFs
This week I have been investigating VQSR, which has generated some questions.
Q1) Which regions should I use from my data for building the VQSR model?
Here I have tried 3 different input datasets:
a) All my variant positions (11Million positions)
b) Variant positions that are in the capture kit (~326k positions) - i.e. used bedtools intersect to only extract variants from (1)
c) Variant positions that are in the capture kit with padding of 100nt either side (~568k positions) - as above but bed has +/-100 on regions + uniq to remove duplicate variants that are now in more than one bed region
For each of the above, I have produced "sensitive" and "specific" datasets:
"Specific" --ts_filter_level 90.0 \ for both SNPs and INDELs
"Sensitive" --ts_filter_level 99.5 \ for SNPs, and --ts_filter_level 99.0 \ for INDELs (as suggested in the definitive FAQ https://www.broadinstitute.org/gatk/guide/article?id=1259 )
I also wanted to see what effect, if any, the "-tranche" argument has - i.e. does it just allow for ease of filtering, or does it affect the mother generated, since it was not clear to me. I applied either 5 tranches or 6:
5-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 90.0 \ for both SNPs and INDELs
6-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 95.0 -tranche 90.0 \ for both SNPs and INDELs
To compare the results I then used bed intersect to get back to the variants that are within the capture kit (~326k, as before). The output is shown in the spreadsheet image below.
What the table appears to show me, is that at the "sensitive" settings (orange background), the results are largely the same - the difference between "PASS" in the set at the bottom where all variants were being used, and the others is mostly accounted for by variants being pushed into the 99.9-100 tranche.
However, when trying to be specific (blue background), the difference between using all variants, or just the capture region/capture+100 is marked. Also surprising (at least for me) is the huge difference in "PASS" in cells E15 and E16, where the only difference was the number of tranches given to the model (note that there is very little difference in the analogous cells in Rows 5/6 andRows 10/11.
Q2) Can somebody explain why there is such a difference in "PASS" rows between All-SPEC and the Capture(s)-Spec
Q3) Can somebody explain why 6 tranches resulted in ~23k more PASSes than 5 tranches for the All-SPEC
Q4) What does "PASS" mean in this context - a score =100? Is it an observation of a variant position in my data that has been observed in the "truth" set? It isn't actually described in the header of the VCF, though presumably the following corresponds:
FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -38682.8777">
Q5) Similarly, why do no variants fall below my lower tranche threshold of 90? Is it because they are all reliable at least to this level?
Q6) Am I just really confused? :-(
Thanks in advance for your help! :-)