VariantRecalibrator resource known training and truth confusion

slubansluban IsraelMember
Running VariantRecalibrator on mouse data raw vcf file with the following command:

gatk --java-options "-Xmx4g" VariantRecalibrator -R Mus_musculus.GRCm38.dna.primary_assembly_ordered.fa -V allSamples_bwa_genotyped.vcf -resource:VCF,known=true,training=false,truth=false,prior=2.0 /mouse/mm10/Ensembl/mus_musculus.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff -mode BOTH -O allSamples_bwa_genotyped.recal --tranches-file allSamples_bwa_genotyped.tranches 2> allSamples_bwa_genotyped_recal.log &

This produces the following:

A USER ERROR has occurred: No training set found! Please provide sets of known polymorphic loci marked with the training=true feature input tag. For example, -resource hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmapFile.vcf

We are not sure what to use for our known, training and truth datasets. Currently, for our known (known=true,training=false,truth=false) we are using /mouse/mm10/Ensembl/mus_musculus.vcf which we downloaded from:


For our training (known=false,training=true,truth=true) we want to use the following (merged different mouse strains):


But this we believe contains only SNPs and we would need to download another large file for indels.

Should we use the first (Ensembl) file as training (known=true,training=true,truth=true)?

How many different files do we need to specify for the -resource parameter, and what should be our known=?,training=?,truth=?,prior=? for them?

Are we using the right files or if not could you please suggest where we can get the right files?


