Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

VQSR inputs

Hi,

Sorry I'm not clear on one point:

Say I want to run VQSR on a set of 30 samples (exomes).
Do I need to run genotypeGVCFs on all 30 GVCF files, and then feed the single joint VCF
output into variantrecalibrator,
or
do I need to feed the 30 individual VCF files into variantrecalibrator?

Thanks
Severine

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited March 2016

    @severinec
    Hi Severine,

    You should run GenotypeGVCFs on your individual GVCFs then run VQSR on the resulting VCF. The GVCFs are intermediate files that are not meant to be used as final analysis files. https://www.broadinstitute.org/gatk/guide/article?id=4017

    -Sheila

  • severinecseverinec Member

    BTW, when running GenotypeGVCFs on a set of many samples, for the sake of gathering enough samples
    to run VQSR over, is it important that all samples come from the same sequencing instrument, similar library,
    read length, etc..., or is it sufficient to just have human samples?
    Thanks!
    Severine

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @severinec
    Hi Severine,

    The most important thing is that the annotations have roughly the same distributions between all the different datasets. Have a look at my answer from February 2015 here.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    To add to @Sheila's response, I would clarify that you should only call exomes with other exomes, genomes with other genomes. No mixing between major experimental design types. Other than that, it is preferable to use samples that were produced with the same technology (library prep/capture kits, sequencing chemistry, read length etc) if you have a choice, but this is not an absolute requirement if you're really constrained. The end result of using data with the same technical generation properties is that the distribution of annotation values will be more similar (which is what @Sheila referred to), whereas different technologies will cause more divergence. That in turn leads to less robustness in the more complicated steps of the pipeline (esp. VQSR), potential batch effects, and loss of callable regions (if using exomes done with different capture kits -> different target intervals).

  • severinecseverinec Member

    Got it. Thanks very much

Sign In or Register to comment.