VariantRecalibrator raising the number of variants used to train the negative model question

I've having an issue with VariantRecalibrator which I've faced in the past.

This is the Error which I've seen before:
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example).

And I understand it's because the total number of variants is to low, but I'm unable to see how that could be an issue for my data set.

wc *_genotyped.vcf = 350678 93598833 1851184230
Which contains 200 background individual + 3 target exomes.

And my command:
java -jar -T VariantRecalibrator -R /data/srynearson/gatk_reference/human_g1k_v37_decoy.fasta --minNumBadVariants 5000 --num_threads 30 -resource:mills,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf -resource:1000G,known=false,training=true,truth=true,prior=10.0 /data/GATK_Bundle/1000G_phase1.indels.b37.vcf -an MQRankSum -an ReadPosRankSum -an FS -input /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_genotyped.vcf -recalFile /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_indel_recal -tranchesFile /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_indel_tranches -rscriptFile /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_indel_plots.R -mode INDEL

And my verison:
The Genome Analysis Toolkit (GATK) vnightly-2014-06-17-gc231c21, Compiled 2014/06/17 00:01:17

So you see my minNumBadVariants is already set to 5000.

Is the error due to the total number of Indels in the VCF file, because otherwise the error of "to few variants" seem wrong.

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Possibly -- can you give the count of indels in the dataset?

  • srynearson1srynearson1 Member
    edited July 2014

    23458 23458 162095 indel.list 327096 327096 654192 snp.list

    Also the output states:
    INFO 09:49:57,875 VariantDataManager - Training with 14997 variants after standard deviation thresholding.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, that should be enough. Can you post the output you get if you run your command with -l DEBUG (lowercase L, not capital I)?

Sign In or Register to comment.