variant recalibration and training-truth set annotation

I have a quick question re: your GATK Best Practice paper (Curr protoc bioinformatics). Forgive me, since I am quite new to SNP analysis; but I have tried to implement the pipeline using the main Galaxy server. I am more comfortable using Galaxy!

I had no problems until I reached the Variant Recalibrator stage. I can see from the error message that my Hapmap_3.3.hg19.vcf, the Omni one and the 1000g one (all downloaded from the GATK bundle, hg19) do not have the required annotations (DP, FS, QD, MQRankSum and ReadPosRankSum) to fulfill the instructions as per the paper.

I have read all the relevant GATK sections on your website as well, but I am unclear how to use these training sets, since they lack the annotations required. My understanding is that I would need the accompanying BAM files to fully annotate these files? I did attempt to use the Annotator function (without the BAM) on the Hapmap vcf; but I got a message saying that I hadn't enough memory (I have not encounterd such an error with Galaxy before; although I haven't that much Galaxy experience)

Can you suggest where I might find such annotated training sets? (hg19)
Or, how to annotate the ones from the bundle?


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi there,

    The resources files don't need to have those annotations; the files from the bundle are sufficient. What happens when you run the VariantRecalibrator is that the program will pick the variants in your dataset (which are fully annotated, or should be) that are also present in the resource files. That's how it gets the annotations to build the model.

Sign In or Register to comment.