# What should I use as known variants/sites for running tool X?

### 1. Notes on known sites

#### Why are they important?

Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.

In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.

#### Human genomes

If you're working on human genomes, you're in luck. We provide sets of known sites in the human genome as part of our resource bundle, and we can give you specific Best Practices recommendations on which sets to use for each tool in the variant calling pipeline. See the next section for details.

#### Non-human genomes

If you're working on genomes of other organisms, things may be a little harder -- but don't panic, we'll try to help as much as we can. We've started a community discussion in the forum on What are the standard resources for non-human genomes? in which we hope people with non-human genomics experience will share their knowledge.

And if it turns out that there is as yet no suitable set of known sites for your organisms, here's how to make your own for the purposes of BaseRecalibration: First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence. Good luck!

Some experimentation will be required to figure out the best way to find the highest confidence SNPs for use here. Perhaps one could call variants with several different calling algorithms and take the set intersection. Or perhaps one could do a very strict round of filtering and take only those variants which pass the test.

### 2. Recommended sets of known sites per tool

#### Summary table

Tool dbSNP 129 - - dbSNP >132 - - Mills indels - - 1KG indels - - HapMap - - Omni
RealignerTargetCreator X X
IndelRealigner X X
BaseRecalibrator X X X
(UnifiedGenotyper/ HaplotypeCaller) X
VariantRecalibrator X X X X
VariantEval X

#### RealignerTargetCreator and IndelRealigner

These tools require known indels passed with the -known argument to function properly. We use both the following files:

• Mills_and_1000G_gold_standard.indels.b37.sites.vcf
• 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

#### BaseRecalibrator

This tool requires known SNPs and indels passed with the -knownSites argument to function properly. We use all the following files:

• The most recent dbSNP release (build ID > 132)
• Mills_and_1000G_gold_standard.indels.b37.sites.vcf
• 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

#### UnifiedGenotyper / HaplotypeCaller

These tools do NOT require known sites, but if SNPs are provided with the -dbsnp argument they will use them for variant annotation. We use this file:

• The most recent dbSNP release (build ID > 132)

#### VariantRecalibrator

For VariantRecalibrator, please see the FAQ article on VQSR training sets and arguments.

#### VariantEval

This tool requires known SNPs passed with the -dbsnp argument to function properly. We use the following file:

• A version of dbSNP subsetted to only sites discovered in or before dbSNP BuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
Geraldine Van der Auwera, PhD

When you do this "looping round until convergence" you use the VCF file from the previous loop as the -knownSites for the current loop. Do you do the same for the BAM file, or use the same one all the time?

Technically the official answer is we don't know because it is very experimental. I think the best thing would be to use the original bam file each time for this. Cheers,

I try both the original bam file (realigned) and the new bam generated in each iteration, and the result does not change. After 4 loops the vcf file converged.

Good to know, thanks for sharing your observations, @Iberna.

Geraldine Van der Auwera, PhD

The Mills_and_1000G_gold_standard.indels.b37.sites.vcf file is not present in the bundle of the current version. It means that we shall use the Mills_and_1000G_gold_standard.indels.b37.vcf instead of the Mills_and_1000G_gold_standard.indels.b37.sites.vcf? Why there are files with the suffix out, like Mills_and_1000G_gold_standard.indels.b37.vcf.out? Congratz for the good work!

Mills_and_1000G_gold_standard.indels.b37.vcf is a sites-only file... We were previously storing 2 copies of the same file, and now we only store one. You can ignore the .out files (I'll remove them).

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

There is conflicting information on what the suggested value of the prior is for the dbSNP/known track to be used in VQSR. On this page (and a few others) the value is Q6. On the link below, (which is very well written by the way!), the suggested value is Q8. http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration

Well, I wouldn't say it is conflicting information but I appreciate that it is confusing for them to be different command lines. The most up-to-date best practice recommendations are found on this page while the other page you reference is a VQSR tutorial with fixed input data and output results.

I hope that helps, Cheers,

Fair enough. In the end, all that matters (at least to me) is what would be the most up-to-date suggested parameter setting(s). If this where you go for them (FAQs), then that works for me ;) Thanks!

Hi Iberna. Can you give me some details of the loop of programs you used for this?

Hi Iberna. Can you give me some details of the loop of programs you used for this?

@Geraldine_VdAuwera I found something contradictory regarding VariantRecalibrator for indels.
"-resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 gold.standard.indel.b37.vcf ";
However, in another post, it is
"-resource:mills,known=true,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \".
So whether to set "known" as "true" or "false"?

The value of "known" does not affect the filtering process; its only effect is that any indels in your samples that overlap with this set will be annotated as "know" rather than "novel". Sometimes we want to use a different set of indels as "known" database. But generally the Mills resource is set to "known"="true".

Geraldine Van der Auwera, PhD

How about ESP variants? Can I use those as resources? If yes, how to set the parameters for VariantRecalibrator?

Hi @feng_b,

It is up to you to decide what resources are appropriate for you data. We can only provide on what we have found to work well with our human data. We encourage you to experiment with other types of resources that you may have at your disposal. If you find something interesting, please share your findings with the community, either by posting them in the forum or linking to your paper when you publish.

Geraldine Van der Auwera, PhD

For the bam file, what do you mean by the 'new bam generate in each iteration'? because I have one original bam file and then I use that for snp calling which gives me the vcf files but the bam file itself doesn't change. thanks in advance for your help.

@lberna said: I try both the original bam file (realigned) and the new bam generated in each iteration, and the result does not change. After 4 loops the vcf file converged.

@Homa, I believe @lberna was talking about the files generated by iterative steps of recalibration to produce an adequate set of known sites for base recalibration. This is a very specific protocol that is not typically necessary if you already have a set of known sites available (as for humans).

Geraldine Van der Auwera, PhD

Am I correct in assuming the current (last edited March 25 ) VQSR settings in this document are superseded by http://gatkforums.broadinstitute.org/discussion/1259/what-vqsr-training-sets-arguments-should-i-use-for-my-specific-project (last edited May 31)?

Martin Pollard, Human Genetics Informatics - Wellcome Trust Sanger Institute

Yes, that's correct.

Geraldine Van der Auwera, PhD

If I would like to use booth "known" files: Mills_and_1000G_gold_standard.indels.hg19.vcf and 1000G_phase1.indels.hg19.vcf, how the command should looks like?

java -jar GenomeAnalysisTK.jar -T IndelRealigner \
-known Mills_and_1000G_gold_standard.indels.hg19.vcf \
-known 1000G_phase1.indels.hg19.vcf \
-R ...
Should I use "known" option twice?
Yes, you use the "known" option twice.

Geraldine Van der Auwera, PhD

The "Summary table" seems to be missing a column "1KG SNPs", which should be assigned to VariantRecalibrator, to reflect the fact that 1000G_phase1.snps.high_confidence.* is also part of the suggested inputs.

Fair point @george, I'll schedule an update of that doc. Thanks for pointing this out!

Geraldine Van der Auwera, PhD

Hi,

the Mills data set is used quite frequently and a very good resource. But actually which publication could be used as reference? I seem to be unable to find this on your pages. "An initial map of insertion and deletion (INDEL) variation in the human genome" (Genome Res. 2006. 16: 1182-1190) looks like it, but it could also be the newer version "Natural genetic variation caused by small insertions and deletions in the human genome" (Genome Res. 2011. 21: 830-839).