Interpreting VQSLOD and Tranche Quality in a Non-Human Model Organism
Hello there! Thanks as always for the lovely tools, I continue to live in them.
- Been wondering how best to interpret my VQSLOD plots/tranches and subsequent VQSLOD scores.
Attached are those plots, and a histogram of my VQSLOD scores as they are found across my replicate samples.
Methods Thus Far
We have HiSeq reads of "mutant" and wt fish, three replicates of each. The sequences were captured by size selected digest, so some have amazing coverage but not all. The mutant fish should contain de novo variants of an almost cancer-like variety (TiTv independent).
As per my interpretation of the best practices, I did an initial calling of the variants (HaplotypeCaller) and filtered them very heavily, keeping only those that could be replicated across all samples. Then I reprocessed and called variants again with that first set as a truth set. I also used the zebrafish dbSNP as "known", though I lowered the Bayesian priors of each from the suggested human ones. The rest of my pipeline follows the best practices fairly closely, GATK version was 2.7-2, and my mapping was with BWA MEM.
My semi-educated guess..
The spike in VQSLOD I see for samples found across all six replicates are simply the rediscovery of those in my truth set, and those with amazing coverage, which is probably fine/good. The part that worries me are the plots and tranches. The plots don't ever really show a section where the "known" set clusters with one set of obviously good variants but not with another. Is that OK or does that and my inflated VQSLOD values ring of poor practice?