We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Creating and using VQSR training sets for non Human genome

FrancoisGuillaumeFrancoisGuillaume Jouy en josasMember

Dear team,

I am working on bovine genome (50 samples, 12X), and I am about to finish HaplotypeCaller step. I am looking for starting the Recalibration of my variants. As no such resources as the one provided for human are available I try to follow several options, for which I'd like if possible some guidance.

One or several training datasets ?

In the documentation several training datasets are used . I have basically none ! As mentioned in the documentation, I though about extracting from my "raw " vcf files a subset of trusted sites. My first intention was to create one list based on SNP for which all the genotyped obtained with the HaplotypeCaller are the same as the genotyped from a SNP chip. The counter part of this is that only SNP are listed, would it be better to also have some other types of variants ?

My alternatives so far are :

  • Use a dbSNP resources and extract the sites listed in dbSNP also found in my data.
  • Retain a list of variant also called with samtools and/or unified genotyper
  • Use a list of SNP available on commercial chip or present in dbSNP (and annotate them)

So I guess I could easily obtain several training datasets, but will it really help ? What should we look for when building training set(s) ?

What are the meaning of the options passed with the training datasets ?

Although apparently self explanatory (known=,training=,truth=,prior=), I have some doubts about the meaning of the options passed with the training datasets. e.g. "known=true", is applied to both dbSNP and Mills datasets which based on the documentation are the worst and the best dataset ! So "known" may not mean "know to vary in the sample". Could the documentation be more explicit about the real meaning of each of the option, their possible effects on the results. Furthermore concerning the prior what is the range of values it could take. As instance as I have sites for which I know that I should find a variant (since I have at least one individual that is heterozygous, based on SNPchip), what kind of value should I give to the prior ? On the other hand, if I add a dbSNP vcf file some of the variants may not segregate within my population/breed/sample how can I made GATK aware of the fact I may be unable to find variation at these sites.

I hope I didn't miss a document somewhere answering these points.
Thank you in advance for your reply


  • FrancoisGuillaumeFrancoisGuillaume Jouy en josasMember

    Since I wrote this question documentation changed (thank you very much for the effort of updates).

    So, for those who have the same question, I guess the answers to the second question is here . The meaning of each options is really nicely explained, all what I was looking for. Well done !

    Concerning the first question :
    I extract a list of site with a strict agreement between genotypes from my vcf and former data from a SNP Chip => The graphical report looks too good and does not really help to interpret what is going on. So far I guess that using different files have at least the advantage to obtain more detailed graphical report of recalibration.

Sign In or Register to comment.