VariantRecalibrator raising the number of variants used to train the negative model question
I've having an issue with VariantRecalibrator which I've faced in the past.
This is the Error which I've seen before:
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example).
And I understand it's because the total number of variants is to low, but I'm unable to see how that could be an issue for my data set.
wc *_genotyped.vcf = 350678 93598833 1851184230
Which contains 200 background individual + 3 target exomes.
And my command:
java -jar -T VariantRecalibrator -R /data/srynearson/gatk_reference/human_g1k_v37_decoy.fasta --minNumBadVariants 5000 --num_threads 30 -resource:mills,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf -resource:1000G,known=false,training=true,truth=true,prior=10.0 /data/GATK_Bundle/1000G_phase1.indels.b37.vcf -an MQRankSum -an ReadPosRankSum -an FS -input /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_genotyped.vcf -recalFile /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_indel_recal -tranchesFile /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_indel_tranches -rscriptFile /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_indel_plots.R -mode INDEL
And my verison:
The Genome Analysis Toolkit (GATK) vnightly-2014-06-17-gc231c21, Compiled 2014/06/17 00:01:17
So you see my minNumBadVariants is already set to 5000.
Is the error due to the total number of Indels in the VCF file, because otherwise the error of "to few variants" seem wrong.