To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

VQSR --maxGaussians paramater

Hi,

I am performing VQSR (GATK 3.7) using the SNP model on individual chromosomes on hundreds of WGS data. However, for some chromosomes, it ran without a problem using the --maxGaussians default, which is 8. We followed your suggestion to lower that number to 4, it still doesn't work. We tried 3, still doesn't work. Finally it works when --maxGaussians is set to 2.

Apparently this --maxGaussians parameter has nothing to do with the chromosome size. My questions are
1) if --maxGaussians = 2 scientifically sound?
2) would you suggest using a uniform --maxGaussians parameter across all the chromosomes, i.e. 2 in this case, or I should just tune the parameter per chromosome?

Can you provide some guidance on this?

Thanks!

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    If you read the docs on VQSR and watch the workshop talks you'll find that parameter refers to the max number of variant clusters the modeling process should attempt to identify.

    Your problem here is that you seem to be running VQSR per-chromosome, which is a bad idea for several reasons -- mainly, it underpowers the model by limiting the amount of data points available per run, and leads to different filtering rules per chromosome, which is not what you want.

    So you should simply run VariantRecalibrator on all the data at once.

  • yyeeyyee Member

    Thanks for the suggestion.

    I have tried running the VariantRecalibrator using all the chromosomes, however, I am still getting the same error.

    INFO 12:28:44,478 VariantRecalibratorEngine - Convergence after 80 iterations!
    INFO 12:28:45,877 VariantRecalibratorEngine - Evaluating full set of 847100 variants...
    INFO 12:28:45,921 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

    ERROR --
    ERROR stack trace

    java.lang.IllegalArgumentException: No data found.
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:489)
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:185)
    at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:115)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @yyee
    Hi,

    Can you please post the exact command you ran? Are you running on the VCF containing hundred of WGS samples?

    Thanks,
    Sheila

  • yyeeyyee Member

    Hi ,

    We are running on 1 VCF that contains hundreds of WGS samples:

    Command:

    Program Args: -T VariantRecalibrator -R /mnt/data/NGS/ref/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /mnt/data/NGS/ref/hg38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /mnt/data/NGS/ref/hg38/hapmap_3.3.hg38.vcf.gz -resource:omni,known=false,training=true,truth=true,prior=12.0 /mnt/data/NGS/ref/hg38/1000G_omni2.5.hg38.vcf.gz -resource:1000G,known=false,training=true,truth=false,prior=10.0 /mnt/data/NGS/ref/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz -an QD -an FS -an DP -an SOR -an MQ -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff -mode SNP -allPoly -L chr20 -tranche 100.0 -tranche 99.9 -tranche 99.8 -tranche 99.7 -tranche 99.5 -tranche 99.3 -tranche 99.0 -tranche 98.5 -tranche 98.0 -tranche 97.0 -tranche 95.0 -tranche 90.0 -nt 1
    --input /data.chr1.genotypedGVCF.g.vcf.gz --input /data.chr2.genotypedGVCF.g.vcf.gz --input /data.chr3.genotypedGVCF.g.vcf.gz --input /data.chr4.genotypedGVCF.g.vcf.gz --input /data.chr5.genotypedGVCF.g.vcf.gz --input /data.chr6.genotypedGVCF.g.vcf.gz --input /data.chr7.genotypedGVCF.g.vcf.gz --input /data.chr8.genotypedGVCF.g.vcf.gz --input /data.chr9.genotypedGVCF.g.vcf.gz --input /data.chr10.genotypedGVCF.g.vcf.gz --input /data.chr11.genotypedGVCF.g.vcf.gz --input /data.chr12.genotypedGVCF.g.vcf.gz --input /data.chr13.genotypedGVCF.g.vcf.gz --input /data.chr14.genotypedGVCF.g.vcf.gz --input /data.chr15.genotypedGVCF.g.vcf.gz --input /data.chr16.genotypedGVCF.g.vcf.gz --input /data.chr17.genotypedGVCF.g.vcf.gz --input /data.chr18.genotypedGVCF.g.vcf.gz --input /data.chr19.genotypedGVCF.g.vcf.gz --input /data.chr20.genotypedGVCF.g.vcf.gz --input /data.chr21.genotypedGVCF.g.vcf.gz --input /data.chr22.genotypedGVCF.g.vcf.gz -recalFile /data.allsamples.recalibrate_SNP.tranches -tranchesFile /data.allsamples.recalibrate_SNP.tranches -rscriptFile /data.allsamples.recalibrate_SNP_plots.R

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @yyee
    Hi,

    It looks like you are still using -L chr20. If you remove that, the error should go away.

    -Sheila

  • yyeeyyee Member

    Hi, we have removed -L chr20. The error is still the same. Thanks.

    Issue · Github
    by Sheila

    Issue Number
    1827
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi there, sorry for the late response @yyee. Can you post the log output you get when running on the whole genome instead of just chr20?

    Assuming there's nothing unexpected there, it looks like the recalibrator is not finding variants that satisfy the standard "bad model" requirement of LOD values under -5.0. One way to work around this is to increase the value of the LOD cutoff; another is to specify a number for minNumBad to force the program to take whatever N variants have the lowest LOD values.

    This shouldn't be necessary but maybe there's something about your callset that doesn't follow our expectations. Were your variants all joint-called with HaplotypeCaller+GenotypeGVCFs?

Sign In or Register to comment.