Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

WGS+WES combined discovery/genotyping

Hi GATK team,

Hope you had great holidays!

We're analyzing small families where some individual have been sequenced by WES (HiSeqX) and others by WES (HiSeq4000). Could you please advise on the best approach to variant discovery and genotyping for these sets. We prefer to avoid the difficult normalization of the different vcf representations of identical variants that results when the WES|WGS sets are analyzed separately.

Our best idea so far is to run HC over mostly overlapping intervals (eg GenCode exons) on all individual samples in both sets, then jointly genotype the mixed g.vcfs (GenotypeGVCFs) - accepting that there will be some ./. calls in each set.

Also, could VQSR cope with the mixed variant properties?

We noticed that @Geraldine_VdAuwera has advised against a similar idea earlier this year (http://gatkforums.broadinstitute.org/wdl/discussion/6834/about-gatk-joint-call), but that was more complex (WES+WGS+RNAseq) and of course you may have looked into this since then.

Thanks in advance for your thoughts and advice

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @KlausNZ! Well, we're all rolling around like small planetoids from too much eating, but it was a good practice run for the winter holidays next month!

    My admonition in the thread you mention was definitely more against joint-calling across major gaps in technology, like the Exome vs WGS divide, which is huge. Generally speaking we're fairly cool with joint-calling across different exome techs as long as you keep in mind that there are potential batch effects that could occur, because we think those are outweighed by the advantages of joint-calling (including not having to deal with the horror of reconciling different representations across datasets). For example, the ExAC dataset was joint-called across multiple exome capture techs (including different capture kits, not just sequencing tech) and turned out well.

    As I said there are some potential batch effects, especially at the level of VQSR because you're mixing sources of technical bias, but between exomes the big culprits tend to be similar enough that tools can build a sensible model. This is in contrast to what happens between e.g. exomes and WGS, where the pattern of coverage and other key sources of technical error manifest differently, some radically so.

    In terms of restricting to intervals, our general recommendation is to restrict everything to the intersection of the exome capture intervals in play, but you can potentially take the union as long as you account for uncovered intervals (ie that it's a built-in expectation) in the downstream analysis.

    I hope that helps!

  • KlausNZKlausNZ Member ✭✭

    Hi Geraldine,

    Many thanks for the quick reply! Yes I was definitely inquiring about combining calls from WGS (sorry for the type!) with calls from WES for different members of a family. Looks like you share my concerns... All would be fine if we can call+VQSR WES samples independently from WGS samples, then combine the variants (invent WGS|WES-specific INFO tags) and be aware that the FORMAT fields aren't directly comparable.
    The difficult part is the normalization of variant notations in vcf. Theoretically this should also arise when calling the same sample as you describe (if one technology discovers an additional ALT allele that other didn't), so your unshared tool might be very useful - are you looking for testers??
    GnomAD came to mind but the method isn't published yet - many thanks for suggesting to contact Daniel's team directly!

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Well, the tool I referred to was written for a one-time analysis, and because it's kind of a fringe use case we weren't sure we wanted to put resources into documenting and supporting it. And now that we're moving to GATK4, it would also involve some porting work. So it's one of those things we have a hard time prioritizing. But it does seem like a shame not to share something that is potentially quite useful. One of the options on the table is that when we switch to GATK, we might just open-source all the private/unsupported/experimental tools from GATK3 that we're not planning to port to 4, for the community to do with as they'd like (including propose ports to the open-source portion of GATK4).

  • FannyLFannyL Member

    Hi Geraldine, we understand that there are biases here and there. If we have multiple WES and WGS studies that will be used together to do a case control analyses, wondering what's the best way to do joint-genotype call in this scenario. Thanks!

  • SheilaSheila Broad InstituteMember, Broadie admin


    This article gives some insight into the methods gnomAD used. I hope it helps.


Sign In or Register to comment.