It looks like you're new here. If you want to get involved, click one of these buttons!
Geraldine_VdAuwera
Posts: 2,486Administrator, GSA Official Member admin
Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.
In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.
If you're working on human genomes, you're in luck. We provide sets of known sites in the human genome as part of our resource bundle, and we can give you specific Best Practices recommendations on which sets to use for each tool in the variant calling pipeline. See the next section for details.
If you're working on genomes of other organisms, things may be a little harder -- but don't panic, we'll try to help as much as we can. We've started a community discussion in the forum on What are the standard resources for non-human genomes? in which we hope people with non-human genomics experience will share their knowledge.
And if it turns out that there is as yet no suitable set of known sites for your organisms, here's how to make your own for the purposes of BaseRecalibration: First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence. Good luck!
Some experimentation will be required to figure out the best way to find the highest confidence SNPs for use here. Perhaps one could call variants with several different calling algorithms and take the set intersection. Or perhaps one could do a very strict round of filtering and take only those variants which pass the test.
| Tool | dbSNP 129 - | - dbSNP >132 - | - Mills indels - | - 1KG indels - | - HapMap - | - Omni |
|---|---|---|---|---|---|---|
| RealignerTargetCreator | X | X | ||||
| IndelRealigner | X | X | ||||
| BaseRecalibrator | X | X | X | |||
| (UnifiedGenotyper/ HaplotypeCaller) | X | |||||
| VariantRecalibrator | X | X | X | X | ||
| VariantEval | X |
These tools require known indels passed with the -known argument to function properly. We use both the following files:
This tool requires known SNPs and indels passed with the -knownSites argument to function properly. We use all the following files:
These tools do NOT require known sites, but if SNPs are provided with the -dbsnp argument they will use them for variant annotation. We use this file:
This tool requires known SNPs and indels passed with the -resource argument to function properly. We use all the following files:
For best results, these resources should be passed with these parameters:
-resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
-resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 gold.standard.indel.b37.vcf
This tool requires known SNPs passed with the -dbsnp argument to function properly. We use the following file:
Geraldine Van der Auwera, PhD
Comments
When you do this "looping round until convergence" you use the VCF file from the previous loop as the -knownSites for the current loop. Do you do the same for the BAM file, or use the same one all the time?
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Technically the official answer is we don't know because it is very experimental. I think the best thing would be to use the original bam file each time for this. Cheers,
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·I try both the original bam file (realigned) and the new bam generated in each iteration, and the result does not change. After 4 loops the vcf file converged.
- Spam
- Abuse
- Troll
1 · Off Topic Disagree Agree 1Like WTF ·Good to know, thanks for sharing your observations, @Iberna.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·The Mills_and_1000G_gold_standard.indels.b37.sites.vcf file is not present in the bundle of the current version. It means that we shall use the Mills_and_1000G_gold_standard.indels.b37.vcf instead of the Mills_and_1000G_gold_standard.indels.b37.sites.vcf? Why there are files with the suffix out, like Mills_and_1000G_gold_standard.indels.b37.vcf.out? Congratz for the good work!
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Mills_and_1000G_gold_standard.indels.b37.vcf is a sites-only file... We were previously storing 2 copies of the same file, and now we only store one. You can ignore the .out files (I'll remove them).
Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·There is conflicting information on what the suggested value of the prior is for the dbSNP/known track to be used in VQSR. On this page (and a few others) the value is Q6. On the link below, (which is very well written by the way!), the suggested value is Q8. http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Well, I wouldn't say it is conflicting information but I appreciate that it is confusing for them to be different command lines. The most up-to-date best practice recommendations are found on this page while the other page you reference is a VQSR tutorial with fixed input data and output results.
I hope that helps, Cheers,
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Fair enough. In the end, all that matters (at least to me) is what would be the most up-to-date suggested parameter setting(s). If this where you go for them (FAQs), then that works for me ;) Thanks!
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Hi Iberna. Can you give me some details of the loop of programs you used for this?
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·@Geraldine_VdAuwera I found something contradictory regarding VariantRecalibrator for indels.
In this article, the resource for known indels is recommended as
"-resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 gold.standard.indel.b37.vcf ";
However, in another post, it is
"-resource:mills,known=true,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \".
So whether to set "known" as "true" or "false"?
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·The value of "known" does not affect the filtering process; its only effect is that any indels in your samples that overlap with this set will be annotated as "know" rather than "novel". Sometimes we want to use a different set of indels as "known" database. But generally the Mills resource is set to "known"="true".
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·How about ESP variants? Can I use those as resources? If yes, how to set the parameters for VariantRecalibrator?
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Hi @feng_b,
It is up to you to decide what resources are appropriate for you data. We can only provide on what we have found to work well with our human data. We encourage you to experiment with other types of resources that you may have at your disposal. If you find something interesting, please share your findings with the community, either by posting them in the forum or linking to your paper when you publish.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·For the bam file, what do you mean by the 'new bam generate in each iteration'? because I have one original bam file and then I use that for snp calling which gives me the vcf files but the bam file itself doesn't change. thanks in advance for your help.
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·@Homa, I believe @lberna was talking about the files generated by iterative steps of recalibration to produce an adequate set of known sites for base recalibration. This is a very specific protocol that is not typically necessary if you already have a set of known sites available (as for humans).
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Am I correct in assuming the current (last edited March 25 ) VQSR settings in this document are superseded by http://gatkforums.broadinstitute.org/discussion/1259/what-vqsr-training-sets-arguments-should-i-use-for-my-specific-project (last edited May 31)?
Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·Yes, that's correct.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 · Off Topic Disagree Agree Like WTF ·