Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Indels Recalibration error message

rcholicrcholic DenverMember

I am trying to recalibrate my VCF files for Indels calling using the below command lines:

java -Xmx2G -jar ../GenomeAnalysisTK.jar -T VariantRecalibrator \

-R ../GATK_ref/hg19.fasta \
-input ./Variants/gcat_set_053_2.raw.snps.indels.vcf \
-nt 4 \
-resource:mills,known=false,training=true,truth=true,prior=12.0 ../GATK_ref/Mills_and_1000G_gold_standard.indels.hg19.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ../GATK_ref/dbsnp_137.hg19.vcf \
-an DP -an FS -an ReadPosRankSum -an MQRankSum \
--maxGaussians 4 \
-percentBad 0.05 \
-minNumBad 1000 \
-mode INDEL \
-recalFile ./Variants/VQSR/gcat_set_053_2.indels.vcf.recal \
-tranchesFile ./Variants/VQSR/gcat_set_053_2.indels.tranches \
-rscriptFile ./Variants/VQSR/gcat_set_053_2.indels.recal.plots.R > ./Variants/VQSR/IndelRecal2-noAnnot.log

I got this error message, even after taking the recommendation (e.g. maxGaussians 4, --percentBad 0.05). What does this error message mean? my files have too few variants? It's exome-seq.

##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Your dataset may simply be too small to use VQSR. How many samples are you analyzing?

  • OprahOprah Member

    I have more or less the same problem: 88 exomes, using v3.1-1 VariantRecalibrator mode INDEL

    INFO ... VariantDataManager - Training with 5808 variants after standard deviation thresholding

    WARN ... VariantDataManager - WARNING: Training with very few variant sites!

    INFO ... VariantRecalibratorEngine - Evaluating full set of 18731 variants ...

    INFO ... VariantDataManager - Training with worst 312 scoring variants --> variants with LOD <= -5.000

    ERROR MESSAGE: NaN LOD value assigned ... consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example)

    I inserted --minNumBadVariants 5000 into my command line, then tried 6000, then tried 7000; the training numbers (5808 and 312 seen above) changed only slightly, and (not surprisingly) I keep getting that error message. If I have to resort to hard-filtering, where can I find the parameters to use? Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
  • OprahOprah Member

    Thanks, I should've found it on my own.

    Anyway, because -minNumBadVariants wasn't doing anything, I dropped it from the command line, and tried -mNG 4 (btw I was already using --maxGaussians 4). I got no error messages! No error messages either with -mNG 3 (the default is 2). Are the results safe to use? If so, is mNG 3 "better" than 4 because it's closer to the default value of 2? Or maybe it doesn't matter when training with only 312 bad variants?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah that's interesting. I would recommend examining the recalibration plots -- if they look reasonable, then the results are probably safe to use. Same approach for choosing which -mNG value is better -- look at which one gives the most reasonable-looking plots.

  • OprahOprah Member

    tranches plot isn't generated for indels, correct?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That's correct, but it's not the tranche plots you want, it's the recal plots that show the clouds of variants plotted along different dimensions.

Sign In or Register to comment.