Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Indels Recalibration error message

rcholicrcholic DenverPosts: 63Member

I am trying to recalibrate my VCF files for Indels calling using the below command lines:

java -Xmx2G -jar ../GenomeAnalysisTK.jar -T VariantRecalibrator \ -R ../GATK_ref/hg19.fasta \ -input ./Variants/gcat_set_053_2.raw.snps.indels.vcf \ -nt 4 \ -resource:mills,known=false,training=true,truth=true,prior=12.0 ../GATK_ref/Mills_and_1000G_gold_standard.indels.hg19.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ../GATK_ref/dbsnp_137.hg19.vcf \ -an DP -an FS -an ReadPosRankSum -an MQRankSum \ --maxGaussians 4 \ -percentBad 0.05 \ -minNumBad 1000 \ -mode INDEL \ -recalFile ./Variants/VQSR/gcat_set_053_2.indels.vcf.recal \ -tranchesFile ./Variants/VQSR/gcat_set_053_2.indels.tranches \ -rscriptFile ./Variants/VQSR/gcat_set_053_2.indels.recal.plots.R > ./Variants/VQSR/IndelRecal2-noAnnot.log

I got this error message, even after taking the recommendation (e.g. maxGaussians 4, --percentBad 0.05). What does this error message mean? my files have too few variants? It's exome-seq.

##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,210Administrator, GSA Member admin

    Your dataset may simply be too small to use VQSR. How many samples are you analyzing?

    Geraldine Van der Auwera, PhD

  • OprahOprah Posts: 21Member

    I have more or less the same problem: 88 exomes, using v3.1-1 VariantRecalibrator mode INDEL

    INFO ... VariantDataManager - Training with 5808 variants after standard deviation thresholding
    WARN ... VariantDataManager - WARNING: Training with very few variant sites!
    INFO ... VariantRecalibratorEngine - Evaluating full set of 18731 variants ...
    INFO ... VariantDataManager - Training with worst 312 scoring variants --> variants with LOD <= -5.000
    ERROR MESSAGE: NaN LOD value assigned ... consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example)

    I inserted --minNumBadVariants 5000 into my command line, then tried 6000, then tried 7000; the training numbers (5808 and 312 seen above) changed only slightly, and (not surprisingly) I keep getting that error message. If I have to resort to hard-filtering, where can I find the parameters to use? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,210Administrator, GSA Member admin
  • OprahOprah Posts: 21Member

    Thanks, I should've found it on my own.
    Anyway, because -minNumBadVariants wasn't doing anything, I dropped it from the command line, and tried -mNG 4 (btw I was already using --maxGaussians 4). I got no error messages! No error messages either with -mNG 3 (the default is 2). Are the results safe to use? If so, is mNG 3 "better" than 4 because it's closer to the default value of 2? Or maybe it doesn't matter when training with only 312 bad variants?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,210Administrator, GSA Member admin

    Ah that's interesting. I would recommend examining the recalibration plots -- if they look reasonable, then the results are probably safe to use. Same approach for choosing which -mNG value is better -- look at which one gives the most reasonable-looking plots.

    Geraldine Van der Auwera, PhD

  • OprahOprah Posts: 21Member

    tranches plot isn't generated for indels, correct?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,210Administrator, GSA Member admin

    That's correct, but it's not the tranche plots you want, it's the recal plots that show the clouds of variants plotted along different dimensions.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.