Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

VQSR --maxGaussians paramater


I am performing VQSR (GATK 3.7) using the SNP model on individual chromosomes on hundreds of WGS data. However, for some chromosomes, it ran without a problem using the --maxGaussians default, which is 8. We followed your suggestion to lower that number to 4, it still doesn't work. We tried 3, still doesn't work. Finally it works when --maxGaussians is set to 2.

Apparently this --maxGaussians parameter has nothing to do with the chromosome size. My questions are
1) if --maxGaussians = 2 scientifically sound?
2) would you suggest using a uniform --maxGaussians parameter across all the chromosomes, i.e. 2 in this case, or I should just tune the parameter per chromosome?

Can you provide some guidance on this?



  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If you read the docs on VQSR and watch the workshop talks you'll find that parameter refers to the max number of variant clusters the modeling process should attempt to identify.

    Your problem here is that you seem to be running VQSR per-chromosome, which is a bad idea for several reasons -- mainly, it underpowers the model by limiting the amount of data points available per run, and leads to different filtering rules per chromosome, which is not what you want.

    So you should simply run VariantRecalibrator on all the data at once.

  • yyeeyyee Member

    Thanks for the suggestion.

    I have tried running the VariantRecalibrator using all the chromosomes, however, I am still getting the same error.

    INFO 12:28:44,478 VariantRecalibratorEngine - Convergence after 80 iterations!
    INFO 12:28:45,877 VariantRecalibratorEngine - Evaluating full set of 847100 variants...
    INFO 12:28:45,921 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

    ERROR --
    ERROR stack trace

    java.lang.IllegalArgumentException: No data found.
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:489)
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:185)
    at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:115)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
  • SheilaSheila Broad InstituteMember, Broadie admin


    Can you please post the exact command you ran? Are you running on the VCF containing hundred of WGS samples?


  • yyeeyyee Member

    Hi ,

    We are running on 1 VCF that contains hundreds of WGS samples:


    Program Args: -T VariantRecalibrator -R /mnt/data/NGS/ref/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /mnt/data/NGS/ref/hg38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /mnt/data/NGS/ref/hg38/hapmap_3.3.hg38.vcf.gz -resource:omni,known=false,training=true,truth=true,prior=12.0 /mnt/data/NGS/ref/hg38/1000G_omni2.5.hg38.vcf.gz -resource:1000G,known=false,training=true,truth=false,prior=10.0 /mnt/data/NGS/ref/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz -an QD -an FS -an DP -an SOR -an MQ -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff -mode SNP -allPoly -L chr20 -tranche 100.0 -tranche 99.9 -tranche 99.8 -tranche 99.7 -tranche 99.5 -tranche 99.3 -tranche 99.0 -tranche 98.5 -tranche 98.0 -tranche 97.0 -tranche 95.0 -tranche 90.0 -nt 1
    --input /data.chr1.genotypedGVCF.g.vcf.gz --input /data.chr2.genotypedGVCF.g.vcf.gz --input /data.chr3.genotypedGVCF.g.vcf.gz --input /data.chr4.genotypedGVCF.g.vcf.gz --input /data.chr5.genotypedGVCF.g.vcf.gz --input /data.chr6.genotypedGVCF.g.vcf.gz --input /data.chr7.genotypedGVCF.g.vcf.gz --input /data.chr8.genotypedGVCF.g.vcf.gz --input /data.chr9.genotypedGVCF.g.vcf.gz --input /data.chr10.genotypedGVCF.g.vcf.gz --input /data.chr11.genotypedGVCF.g.vcf.gz --input /data.chr12.genotypedGVCF.g.vcf.gz --input /data.chr13.genotypedGVCF.g.vcf.gz --input /data.chr14.genotypedGVCF.g.vcf.gz --input /data.chr15.genotypedGVCF.g.vcf.gz --input /data.chr16.genotypedGVCF.g.vcf.gz --input /data.chr17.genotypedGVCF.g.vcf.gz --input /data.chr18.genotypedGVCF.g.vcf.gz --input /data.chr19.genotypedGVCF.g.vcf.gz --input /data.chr20.genotypedGVCF.g.vcf.gz --input /data.chr21.genotypedGVCF.g.vcf.gz --input /data.chr22.genotypedGVCF.g.vcf.gz -recalFile /data.allsamples.recalibrate_SNP.tranches -tranchesFile /data.allsamples.recalibrate_SNP.tranches -rscriptFile /data.allsamples.recalibrate_SNP_plots.R


  • SheilaSheila Broad InstituteMember, Broadie admin


    It looks like you are still using -L chr20. If you remove that, the error should go away.


  • yyeeyyee Member

    Hi, we have removed -L chr20. The error is still the same. Thanks.

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there, sorry for the late response @yyee. Can you post the log output you get when running on the whole genome instead of just chr20?

    Assuming there's nothing unexpected there, it looks like the recalibrator is not finding variants that satisfy the standard "bad model" requirement of LOD values under -5.0. One way to work around this is to increase the value of the LOD cutoff; another is to specify a number for minNumBad to force the program to take whatever N variants have the lowest LOD values.

    This shouldn't be necessary but maybe there's something about your callset that doesn't follow our expectations. Were your variants all joint-called with HaplotypeCaller+GenotypeGVCFs?

  • I was having this same error with my WES data, but by setting -max-gaussians to 4, it fixed the issue.

Sign In or Register to comment.