If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Recommended parameters for VariantRecalibrator in GATK4


I'm trying to figure out the recommended set of parameters for VariantRecalibrator on SNPs. The official joint discovery pipeline on Github [1] and the example documentation [2] seem to differ in a few ways: different number of gaussians (6 vs 8), slightly different training feature sets, and different priors for dbSNP (7 vs 2). There are other differences like --trust-all-polimorphic and training on every 10th variant in the Github version.

The dataset I'm working on: 17k samples, 15x WGS, but only called in exonic regions so far (i.e should look like WES, but with lower, yet more homogeneous depth). Variants are called with GATK4.

I understand that a lot of tweaking can be made to the parameters, but would like to start with the most "standard" version first.


[1] The Github pipeline:

${gatk_path} --java-options "-Xmx100g -Xms100g" \ VariantRecalibrator \ -V ${sites_only_variant_filtered_vcf} \ -O ${recalibration_filename} \ --tranches-file ${tranches_filename} \ --trust-all-polymorphic \ -tranche ${sep=' -tranche ' recalibration_tranche_values} \ -an ${sep=' -an ' recalibration_annotation_values} \ -mode SNP \ --sample-every-Nth-variant ${downsampleFactor} \ --output-model ${model_report_filename} \ --max-gaussians 6 \ -resource hapmap,known=false,training=true,truth=true,prior=15:${hapmap_resource_vcf} \ -resource omni,known=false,training=true,truth=true,prior=12:${omni_resource_vcf} \ -resource 1000G,known=false,training=true,truth=false,prior=10:${one_thousand_genomes_resource_vcf} \ -resource dbsnp,known=true,training=false,truth=false,prior=7:${dbsnp_resource_vcf}

and training features:

"JointGenotyping.indel_recalibration_annotation_values": ["FS", "ReadPosRankSum", "MQRankSum", "QD", "SOR", "DP"]

[2] Example in the docs:

gatk VariantRecalibrator \ -R Homo_sapiens_assembly38.fasta \ -V input.vcf.gz \ --resource hapmap,known=false,training=true,truth=true,prior=15.0:hapmap_3.3.hg38.sites.vcf.gz \ --resource omni,known=false,training=true,truth=false,prior=12.0:1000G_omni2.5.hg38.sites.vcf.gz \ --resource 1000G,known=false,training=true,truth=false,prior=10.0:1000G_phase1.snps.high_confidence.hg38.vcf.gz \ --resource dbsnp,known=true,training=false,truth=false,prior=2.0:Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode SNP \ --recal-file output.recal \ --tranches-file output.tranches \ --rscript-file output.plots.R


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    I am not sure why those differences exist, but I will check with the team. In the meantime, it is probably best to stick with the official github workflow parameters as they are the most current. The docs have old example commands which may have been updated recently.


  • Thanks @Sheila! Will try rerunning with the Github pipeline. Any downside in training the model on all sites, rather than subsetting to every 10th? We only have exonic regions, so it's not too computationally expensive

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    Any downside in training the model on all sites, rather than subsetting to every 10th?

    No, I think that would be better.


Sign In or Register to comment.