We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

VQSR training set

Hi all,
I would like to try applying VQSR to my WGS data. What I could use is a SNP database built with GBS on different samples than the ones present in my actual callset. I have a question: do we need variant annotations in the training set, or is it useless (can't figure it out)? If we need it, what type of annotations are mandatory? Should they be strictly the same as the ones present in my actual callset? What happens if it is not the case?

Best Answers


  • benjaminpelissiebenjaminpelissie Madison, WIMember

    That is what I thought. THANKS! :)

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Hi again. Now I run into another problem. I have a SNP list as a .txt file (2 columns: scaffold, position in the scaffold) and need to get a VCF file in order to provide it to VariantRecalibrator. Is there a tool that allow to do so? I tried VariantsToVCF and recoding with VCFTOOLS but couldn't obtain any result...

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Thank you. I'm hoping someone with this kind of script will read your answer then :)

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    One thought: can't I just use SelectVariants and output only one sample (from my cohorte VCF produced by joint genotyping) while restraining it to the SNPs from my list (obtained by GBS on different samples)? Is it going to be a problem if this (hopefully) created set contains actually more informations (like annotations or sample informations) than needed for being used as a truth+training set on my data?

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    I have never tried that myself, but you can try! :smile: If you turn the .txt file to a proper interval list that GATK accepts, and run SelectVariants with -L, it should work. I think if there are any sites not in your VCF, they should be ignored. Let us know how it goes!


  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Thanks! I am trying it right now and will let you know about it, sure.

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Apparently it went well! :) I used -L with a text file containing SNP IDs (scaffold:position) and -sites_only in order to output only information about positions and discarding any sample-level information. It ran pretty quickly, even though I was processing a cohorte VCF of 88 whole genomes and a text file containing >40k SNPs.

    One very weird thing I noticed is that when I tried to use -nt (in conjunction with -Xmx10G) with SelectVariants the estimated run time was between ~50h and 5 days, while it took les than 5 minutes when I removed the -nt argument. Anyway.

    All in all it is ok for me! Thanks again!

  • bmansfeldbmansfeld East Lansing, MIMember

    Hey @benjaminpelissie and @Sheila,
    I would like to do something similar in my system (cucumber) and have some questions.
    If I understand correctly, you are using SelectVariants to extract variants that exist in your samples at the positions described in your text file?
    What happens when you don't have a variant in those positions in your samples? My text file was produced from reseq of 115 different lines and will include several fold more SNP than I have in my lines. I guess it won't matter for calibration since we don't have variants there... How can I make this file useful for other samples, say that might have a SNP where my current ones don't..
    Since the actual genotype at those positions doesn't matter we simply need to know where the variants are.
    Is this true for BQSR training set too?
    As I write this it's making a little more sense to me.
    Let me know your thoughts.
    Thanks in advance,

  • benjaminpelissiebenjaminpelissie Madison, WIMember

    Hello @bmansfeld,
    You are right about my use of SelectVariants. Although I never checked it formally, I am almost sure that SNPs present in your (soon-to-be) training set but absent in your focal cohorte will just be overlooked, ie. it's gonna select only SNPs that overlap between your files. If you add new samples in the future, the same procedure will be needed, so the overlap between files might not be the same (depending on your variant calling for the new samples) and then you could end up with SNPs from your training that "appear" or "disappear" when you compare you different cohorte. But if you want an homogeneous SNP representation among your different cohortes you still can re-SelectVariants specifying that SNPs should be present in 100% of your samples. As for BQSR, I have too few knowledge about it to be of any help.
    Good luck!

Sign In or Register to comment.