Creating and using VQSR training sets for non Human genome
I am working on bovine genome (50 samples, 12X), and I am about to finish HaplotypeCaller step. I am looking for starting the Recalibration of my variants. As no such resources as the one provided for human are available I try to follow several options, for which I'd like if possible some guidance.
One or several training datasets ?
In the documentation several training datasets are used . I have basically none ! As mentioned in the documentation, I though about extracting from my "raw " vcf files a subset of trusted sites. My first intention was to create one list based on SNP for which all the genotyped obtained with the HaplotypeCaller are the same as the genotyped from a SNP chip. The counter part of this is that only SNP are listed, would it be better to also have some other types of variants ?
My alternatives so far are :
- Use a dbSNP resources and extract the sites listed in dbSNP also found in my data.
- Retain a list of variant also called with samtools and/or unified genotyper
- Use a list of SNP available on commercial chip or present in dbSNP (and annotate them)
So I guess I could easily obtain several training datasets, but will it really help ? What should we look for when building training set(s) ?
What are the meaning of the options passed with the training datasets ?
Although apparently self explanatory (known=,training=,truth=,prior=), I have some doubts about the meaning of the options passed with the training datasets. e.g. "known=true", is applied to both dbSNP and Mills datasets which based on the documentation are the worst and the best dataset ! So "known" may not mean "know to vary in the sample". Could the documentation be more explicit about the real meaning of each of the option, their possible effects on the results. Furthermore concerning the prior what is the range of values it could take. As instance as I have sites for which I know that I should find a variant (since I have at least one individual that is heterozygous, based on SNPchip), what kind of value should I give to the prior ? On the other hand, if I add a dbSNP vcf file some of the variants may not segregate within my population/breed/sample how can I made GATK aware of the fact I may be unable to find variation at these sites.
I hope I didn't miss a document somewhere answering these points.
Thank you in advance for your reply