WGS+WES combined discovery/genotyping

Hi GATK team,

Hope you had great holidays!

We're analyzing small families where some individual have been sequenced by WES (HiSeqX) and others by WES (HiSeq4000). Could you please advise on the best approach to variant discovery and genotyping for these sets. We prefer to avoid the difficult normalization of the different vcf representations of identical variants that results when the WES|WGS sets are analyzed separately.

Our best idea so far is to run HC over mostly overlapping intervals (eg GenCode exons) on all individual samples in both sets, then jointly genotype the mixed g.vcfs (GenotypeGVCFs) - accepting that there will be some ./. calls in each set.

Also, could VQSR cope with the mixed variant properties?

We noticed that @Geraldine_VdAuwera has advised against a similar idea earlier this year (http://gatkforums.broadinstitute.org/wdl/discussion/6834/about-gatk-joint-call), but that was more complex (WES+WGS+RNAseq) and of course you may have looked into this since then.

Thanks in advance for your thoughts and advice

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @KlausNZ! Well, we're all rolling around like small planetoids from too much eating, but it was a good practice run for the winter holidays next month!

    My admonition in the thread you mention was definitely more against joint-calling across major gaps in technology, like the Exome vs WGS divide, which is huge. Generally speaking we're fairly cool with joint-calling across different exome techs as long as you keep in mind that there are potential batch effects that could occur, because we think those are outweighed by the advantages of joint-calling (including not having to deal with the horror of reconciling different representations across datasets). For example, the ExAC dataset was joint-called across multiple exome capture techs (including different capture kits, not just sequencing tech) and turned out well.

    As I said there are some potential batch effects, especially at the level of VQSR because you're mixing sources of technical bias, but between exomes the big culprits tend to be similar enough that tools can build a sensible model. This is in contrast to what happens between e.g. exomes and WGS, where the pattern of coverage and other key sources of technical error manifest differently, some radically so.

    In terms of restricting to intervals, our general recommendation is to restrict everything to the intersection of the exome capture intervals in play, but you can potentially take the union as long as you account for uncovered intervals (ie that it's a built-in expectation) in the downstream analysis.

    I hope that helps!

  • Hi Geraldine,

    Many thanks for the quick reply! Yes I was definitely inquiring about combining calls from WGS (sorry for the type!) with calls from WES for different members of a family. Looks like you share my concerns... All would be fine if we can call+VQSR WES samples independently from WGS samples, then combine the variants (invent WGS|WES-specific INFO tags) and be aware that the FORMAT fields aren't directly comparable.
    The difficult part is the normalization of variant notations in vcf. Theoretically this should also arise when calling the same sample as you describe (if one technology discovers an additional ALT allele that other didn't), so your unshared tool might be very useful - are you looking for testers??
    GnomAD came to mind but the method isn't published yet - many thanks for suggesting to contact Daniel's team directly!

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Well, the tool I referred to was written for a one-time analysis, and because it's kind of a fringe use case we weren't sure we wanted to put resources into documenting and supporting it. And now that we're moving to GATK4, it would also involve some porting work. So it's one of those things we have a hard time prioritizing. But it does seem like a shame not to share something that is potentially quite useful. One of the options on the table is that when we switch to GATK, we might just open-source all the private/unsupported/experimental tools from GATK3 that we're not planning to port to 4, for the community to do with as they'd like (including propose ports to the open-source portion of GATK4).

  • Hi Geraldine, we understand that there are biases here and there. If we have multiple WES and WGS studies that will be used together to do a case control analyses, wondering what's the best way to do joint-genotype call in this scenario. Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    This article gives some insight into the methods gnomAD used. I hope it helps.


Sign In or Register to comment.