Interpreting results of VQSR on a non-human species
Hello and happy new Year!
I have done VQSR on a non-human data set. Data corresponds to WGS (~20x) on 60 mice sequenced individually. Total number of raw SNPs (according to GATK best practices):
As a truth/training resource, I used the sites that PASSed generic hard filters and that are found in any of the two mouse genotyping arrays: GIGA-Mouse Universal Genotyping Array; Mouse Diversity Array. Total number of SNPs in resource:
As a training-only resource I used variants reported by Sanger's
Mouse Genomes Project found accross 36 strains and that PASSed their filters. There were
9,130,946 sites found in my raw SNPs.
Known sites correspond to dbSNP150.
This is a snipet of the the command line:
--resource TRUTH,known=false,training=true,truth=true,prior=12.0:RawSNPs_in_GIGA_or_MDA_OnlyPASS.vcf \ --resource sanger,known=false,training=true,truth=false,prior=10.0:mgp.v5.merged.snps_all.dbSNP142_PASS_final.vcf \ --resource dbsnp,known=true,training=false,truth=false,prior=2.0:mus_musculus.vcf \
VQSR (90% truth sensitivity), there are
8,070,948 SNPs that PASSed (~50% of the raw SNPs). Of which
8,060,873 are bi-allelic.
The tranche plot shows a
Ti/Tv ratio of
1/5 of false positives at tranche 90. Also,
Ti/Tv has a wide range across tranches (1.7 to 1.078). Overall, I think the tranche plot is telling me there is room for improvement.
However, the tranche plot referes to novel variants (not found in any of the resources, incl dbSNP), and the Ti/Tv ratio for variants found in dbSNP as reported by Picard's
CollectVariantCallingMetrics corresponds to 2.15, which is much satisfactory and represents >95% of all bi-allelic SNPs.
| TOTAL_SNPS| NUM_IN_DB_SNP| NOVEL_SNPS| PCT_DBSNP| DBSNP_TITV| NOVEL_TITV| |----------:|-------------:|----------:|---------:|----------:|----------:| | 8060873| 7674154| 386719| 0.952025| 2.146346| 1.700175|
I would like to clarify the following:
1) Was the contruction of the truth set reasonable (i.e. enough number of sites)?
2) If novel variants are more likely to be false positives, how are false and true positives defined in the traches plot, which is constructed from novel variants only?
3) Should I deal with the low Ti/Tv ratio at tranche 90, considering that novel SNPs correspond to <5% of all PASSing SNPs after VQSR?
Your feedback will be greatly appreciated!