VQSR question.

Hello,

Its mentioned in the VQSR documentation that in order to achieve the best results, we need to call variants with at least 30 samples each time and then apply VQSR on the whole dataset. In our lab we only multiplex 4 exome samples in a run. Does this mean that for every single run I need to add 26 additional exome samples and call variants and then apply VQSR to the entire data set? Is it not recommended to run VQSR on single exome VCF file? Are there any other options for me ? (except hard filtering of variants)

Thanks,
Teja.

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Teja,

    That's right. Our recommendation, if you do not have a sufficiently large cohort of your own, is to get exomes from the 1000 Genomes Project and add those to your cohort. Either you call variants on the entire cohort together, or you use the HaplotypeCaller's GVCF mode to call variants per samples, and do joint genotyping on the resulting GVCFs together. In either case you then run the VQSR on the final multisample VCF.

  • nvtejanvteja Member

    Hello Geraldine,

    Thanks for reply. So If I understand correctly, it is okay if I call GVCF's only once on the 1000Genomes project BAMs , and then add those GVCF's to our cohorts for joint genotyping right? Or should I create GVCF's on 1000G BAM files every single time I need to call variants in our exome runs? I am assuming that calling GVCF's once on 1000G files is enough, but I just want to make sure. Also, if we have 30 exome samples from the past, I can call GVCF's on those 30 samples and use those for future exome runs right?

    Thanks a lot! This forum has been very helpful.
    Teja.

  • nvtejanvteja Member

    Also, one more question. Currently we multiplex 4 exomes with 24 targeted sequencing panels in one run. After I call variants in GVCF mode on each sample, can I do the joint genotyping on all these samples at the same time even though our targeted samples are not targeting the entire exome ? (actually in targeted samples we only target 100-150 genes at a time). In this case, can I use the same GVCFs that I would add to the exome samples for joint genotyping to the targeted samples too? Please let me know if this doesn't make sense to you.

    Thanks again!
    Teja.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @nvteja

    Hi Teja,

    You are correct that you can simply call GVCFs on the 1000Genomes bams once and save those for the future.

    If you have 30 exome samples from the past, you can definitely use those as your extra samples instead of the 1000Genomes bams.

    I am not sure I understand your last question. I think you are asking if the exomes do not span the same intervals, can you still use the GVCF files together? We do not test it like that at the Broad, but in principle it should work fine.

    -Sheila

  • nvtejanvteja Member

    Thanks a lot for the reply Sheila.

    For my last question what I meant was, If I create GVCFs based on our 30 past exome samples, can I still use those GVCFs for non-exome samples? i.e samples in which we target only certain genes and not the whole exome? These genes will certainly overlap exome targets, but not entirely (when we target specific genes, we also target certain deep intronic regions). The reason I am asking is that, I do not want to use hard filtering on these targeted samples and would prefer recalibrating variant qualites in these samples through VQSR.

    Let me know if that is okay.

    Teja.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @nvteja

    Hi Teja,

    Thanks for clarifying. Sure, you can use the 30 exome samples with non-exome samples.

    -Sheila

  • nvtejanvteja Member

    Thanks Sheila. In that case, as exome samples will not have any information in GVCF files for the deep intronic regions, how will that affect the VQSR process for the variants in those deep intronic regions in our targeted samples?

    Teja.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @nvteja

    Hi Teja,

    After talking to Geraldine, I found out the most important thing is to make sure the various annotations you will use in VQSR are distributed roughly the same way for both the exome samples and the targeted samples. For example, you can plot DP for the exome samples and DP for the targeted samples and make sure they have approximately the same distribution.

    This is important because VQSR uses the annotations in a Gaussian Mixture Model. If the annotations are different in the samples, that can cause incorrect results.

    https://www.broadinstitute.org/gatk/guide/article?id=39

    -Sheila

  • nvtejanvteja Member

    Hello Sheila,

    Good Morning.

    I have noticed that the HaploType Caller is taking a lot of time to create GVCFs on the whole exome bam files (even with the -L argument). Is it okay to split the GVCF processing by chromosome using the -L argument with per chromosome intervals, thus creating chromosome level GVCFs? If so, then at the GenotypeGVCF step is it recommended to combine all the chromosome GVCFs into one single file per sample in the cohort and then perform the joint genotyping and VQSR on the whole data set ? Or is it still okay to separate out the processing into per chromosome for joint genotyping and VQSR ?

    Thanks,
    Teja.

  • nvtejanvteja Member

    Thanks a lot Geraldine. That was exactly what I had in mind.

    Teja.

Sign In or Register to comment.