The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.10.4 has MAJOR CHANGES that impact throughput of pipelines. Default compression is now 1 instead of 5, and Picard now handles compressed data with the Intel Deflator/Inflator instead of JDK.
GATK version 4.beta.2 (i.e. the second beta release) is out. See the GATK4 BETA page for download and details.

Indels Recalibration error message

I am trying to recalibrate my VCF files for Indels calling using the below command lines:

java -Xmx2G -jar ../GenomeAnalysisTK.jar -T VariantRecalibrator \

-R ../GATK_ref/hg19.fasta \
-input ./Variants/gcat_set_053_2.raw.snps.indels.vcf \
-nt 4 \
-resource:mills,known=false,training=true,truth=true,prior=12.0 ../GATK_ref/Mills_and_1000G_gold_standard.indels.hg19.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ../GATK_ref/dbsnp_137.hg19.vcf \
-an DP -an FS -an ReadPosRankSum -an MQRankSum \
--maxGaussians 4 \
-percentBad 0.05 \
-minNumBad 1000 \
-mode INDEL \
-recalFile ./Variants/VQSR/gcat_set_053_2.indels.vcf.recal \
-tranchesFile ./Variants/VQSR/gcat_set_053_2.indels.tranches \
-rscriptFile ./Variants/VQSR/gcat_set_053_2.indels.recal.plots.R > ./Variants/VQSR/IndelRecal2-noAnnot.log

I got this error message, even after taking the recommendation (e.g. maxGaussians 4, --percentBad 0.05). What does this error message mean? my files have too few variants? It's exome-seq.

##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Your dataset may simply be too small to use VQSR. How many samples are you analyzing?

  • OprahOprah Member

    I have more or less the same problem: 88 exomes, using v3.1-1 VariantRecalibrator mode INDEL

    INFO ... VariantDataManager - Training with 5808 variants after standard deviation thresholding

    WARN ... VariantDataManager - WARNING: Training with very few variant sites!

    INFO ... VariantRecalibratorEngine - Evaluating full set of 18731 variants ...

    INFO ... VariantDataManager - Training with worst 312 scoring variants --> variants with LOD <= -5.000

    ERROR MESSAGE: NaN LOD value assigned ... consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example)

    I inserted --minNumBadVariants 5000 into my command line, then tried 6000, then tried 7000; the training numbers (5808 and 312 seen above) changed only slightly, and (not surprisingly) I keep getting that error message. If I have to resort to hard-filtering, where can I find the parameters to use? Thanks.

  • OprahOprah Member

    Thanks, I should've found it on my own.

    Anyway, because -minNumBadVariants wasn't doing anything, I dropped it from the command line, and tried -mNG 4 (btw I was already using --maxGaussians 4). I got no error messages! No error messages either with -mNG 3 (the default is 2). Are the results safe to use? If so, is mNG 3 "better" than 4 because it's closer to the default value of 2? Or maybe it doesn't matter when training with only 312 bad variants?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah that's interesting. I would recommend examining the recalibration plots -- if they look reasonable, then the results are probably safe to use. Same approach for choosing which -mNG value is better -- look at which one gives the most reasonable-looking plots.

  • OprahOprah Member

    tranches plot isn't generated for indels, correct?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    That's correct, but it's not the tranche plots you want, it's the recal plots that show the clouds of variants plotted along different dimensions.

Sign In or Register to comment.