We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Recommended parameters for VariantRecalibrator in GATK4


I'm trying to figure out the recommended set of parameters for VariantRecalibrator on SNPs. The official joint discovery pipeline on Github [1] and the example documentation [2] seem to differ in a few ways: different number of gaussians (6 vs 8), slightly different training feature sets, and different priors for dbSNP (7 vs 2). There are other differences like --trust-all-polimorphic and training on every 10th variant in the Github version.

The dataset I'm working on: 17k samples, 15x WGS, but only called in exonic regions so far (i.e should look like WES, but with lower, yet more homogeneous depth). Variants are called with GATK4.

I understand that a lot of tweaking can be made to the parameters, but would like to start with the most "standard" version first.


[1] The Github pipeline:

${gatk_path} --java-options "-Xmx100g -Xms100g" \ VariantRecalibrator \ -V ${sites_only_variant_filtered_vcf} \ -O ${recalibration_filename} \ --tranches-file ${tranches_filename} \ --trust-all-polymorphic \ -tranche ${sep=' -tranche ' recalibration_tranche_values} \ -an ${sep=' -an ' recalibration_annotation_values} \ -mode SNP \ --sample-every-Nth-variant ${downsampleFactor} \ --output-model ${model_report_filename} \ --max-gaussians 6 \ -resource hapmap,known=false,training=true,truth=true,prior=15:${hapmap_resource_vcf} \ -resource omni,known=false,training=true,truth=true,prior=12:${omni_resource_vcf} \ -resource 1000G,known=false,training=true,truth=false,prior=10:${one_thousand_genomes_resource_vcf} \ -resource dbsnp,known=true,training=false,truth=false,prior=7:${dbsnp_resource_vcf}

and training features:

"JointGenotyping.indel_recalibration_annotation_values": ["FS", "ReadPosRankSum", "MQRankSum", "QD", "SOR", "DP"]

[2] Example in the docs:

gatk VariantRecalibrator \ -R Homo_sapiens_assembly38.fasta \ -V input.vcf.gz \ --resource hapmap,known=false,training=true,truth=true,prior=15.0:hapmap_3.3.hg38.sites.vcf.gz \ --resource omni,known=false,training=true,truth=false,prior=12.0:1000G_omni2.5.hg38.sites.vcf.gz \ --resource 1000G,known=false,training=true,truth=false,prior=10.0:1000G_phase1.snps.high_confidence.hg38.vcf.gz \ --resource dbsnp,known=true,training=false,truth=false,prior=2.0:Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode SNP \ --recal-file output.recal \ --tranches-file output.tranches \ --rscript-file output.plots.R


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    I am not sure why those differences exist, but I will check with the team. In the meantime, it is probably best to stick with the official github workflow parameters as they are the most current. The docs have old example commands which may have been updated recently.


  • Thanks @Sheila! Will try rerunning with the Github pipeline. Any downside in training the model on all sites, rather than subsetting to every 10th? We only have exonic regions, so it's not too computationally expensive

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    Any downside in training the model on all sites, rather than subsetting to every 10th?

    No, I think that would be better.


Sign In or Register to comment.