We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

VariantRecalibrator resource known training and truth confusion

slubansluban IsraelMember
Running VariantRecalibrator on mouse data raw vcf file with the following command:

gatk --java-options "-Xmx4g" VariantRecalibrator -R Mus_musculus.GRCm38.dna.primary_assembly_ordered.fa -V allSamples_bwa_genotyped.vcf -resource:VCF,known=true,training=false,truth=false,prior=2.0 /mouse/mm10/Ensembl/mus_musculus.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff -mode BOTH -O allSamples_bwa_genotyped.recal --tranches-file allSamples_bwa_genotyped.tranches 2> allSamples_bwa_genotyped_recal.log &

This produces the following:

A USER ERROR has occurred: No training set found! Please provide sets of known polymorphic loci marked with the training=true feature input tag. For example, -resource hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmapFile.vcf

We are not sure what to use for our known, training and truth datasets. Currently, for our known (known=true,training=false,truth=false) we are using /mouse/mm10/Ensembl/mus_musculus.vcf which we downloaded from:


For our training (known=false,training=true,truth=true) we want to use the following (merged different mouse strains):


But this we believe contains only SNPs and we would need to download another large file for indels.

Should we use the first (Ensembl) file as training (known=true,training=true,truth=true)?

How many different files do we need to specify for the -resource parameter, and what should be our known=?,training=?,truth=?,prior=? for them?

Are we using the right files or if not could you please suggest where we can get the right files?


Sign In or Register to comment.