We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Minimising Batch Effects in VariantRecalibration

Hi there,

I've checked the best practices and documentation to look for correct way to run VQSR to prevent the dreaded "No data found" error. What keeps popping up is not to run single-sample vcf's but rather to run all samples together as one multisample vcf.

What then is the correct way to correct for batch effects using this input for the VartiantRecalibration tool? Is this something I should be worried about?
Say I use a multisample vcf as input where N=100 and then run another instance where the input sample has N=80, are there inherent dangers in batch effects between the 2 runs?
Would you recommend running each sample in a batch with the 1000G as the multisample vcf i.e. N=1001?

Any other thoughts or ideas?


Best Answer


  • foxyjohnfoxyjohn Member

    Also what if batch 1, N=100, has all one population subtype (e.g. asian) and batch 2, N=80, has 90% european - won't there be knock-on effects using these different multi-samples in the recalibration?

    Thanks for any input.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Hi Sean,

    In general, recalibrating cohorts in separate batches does introduce batch effects, yes. This is not so much influenced by population ethnicity though; it's more the effect that each sample contributes some information to the recalibration process, so results become non-independent. That is why we recommend that any samples that you want to analyze together (e.g. to compare to each other) should be processed together, from joint genotyping through filtering.

    In your situation, if you plan to analyse the two batches together at any point in your downstream work, I would recommend processing them as a single batch of N=180. Enabling that is one of the key gains of the GVCF workflow.

    If the two cohorts are completely separate and you have no plans to compare them to each other, you can safely process them separately, with the caveat that if you ever change your mind you would have to reprocess them from GVCF.
  • foxyjohnfoxyjohn Member

    Thanks Geraldine,

    Each sample will be compared individually to the patient cohort + previous & future cohorts. We had been analysing them individually in VQSR up until now (and getting away with it as we suspect HaplotypeCaller v3.3 wasn't as good as v3.6 in making good calls, thus the amount of "bad" variants in one sample was enough to pass the VQSR training set thresholds - v3.6 is not allowing us to get away with this!). We're trying to figure out the best strategy work around this now, so any help would be invaluable!

    I have one follow up question: Assuming the population subtypes in my cohort (N=180) is not equally distributed, what is the best way to correct for this when running VQSR with multisamples? Would padding with 1000G samples be a good option in general - again assuming best practices for our samples & 1000G were roughly similar.

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • foxyjohnfoxyjohn Member

    Thanks Geraldine, appreciate your feedback. Food for thought.

Sign In or Register to comment.