The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

Should I analyze my samples alone or together?

Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,118 admin
edited May 2015 in FAQs

Together is (almost always) better than alone

We recommend performing variant discovery in a way that enables joint analysis of multiple samples, as laid out in our Best Practices workflow. That workflow includes a joint analysis step that empowers variant discovery by providing the ability to leverage population-wide information from a cohort of multiple sample, allowing us to detect variants with great sensitivity and genotype samples as accurately as possible. Our workflow recommendations provide a way to do this in a way that is scalable and allows incremental processing of the sequencing data.

The key point is that you don’t actually have to call variants on all your samples together to perform a joint analysis. We have developed a workflow that allows us to decouple the initial identification of potential variant sites (ie variant calling) from the genotyping step, which is the only part that really needs to be done jointly. Since GATK 3.0, you can use the HaplotypeCaller to call variants individually per-sample in -ERC GVCF mode, followed by a joint genotyping step on all samples in the cohort, as described in this method article. This achieves what we call incremental joint discovery, providing you with all the benefits of classic joint calling (as described below) without the drawbacks.

Why "almost always"? Because some people have reported missing a small fraction of singletons (variants that are unique to individual samples) when using the new method. For most studies, this is an acceptable tradeoff (which is reduced by the availability of high quality sequencing data), but if you are very specifically looking for singletons, you may need to do some careful evaluation before committing to this method.


Previously established cohort analysis strategies

Until recently, three strategies were available for variant discovery in multiple samples:

- single sample calling: sample BAMs are analyzed individually, and individual call sets are combined in a downstream processing step;
- batch calling: sample BAMs are analyzed in separate batches, and batch call sets are merged in a downstream processing step;
- joint calling: variants are called simultaneously across all sample BAMs, generating a single call set for the entire cohort.

The best of these, from the point of view of variant discovery, was joint calling, because it provided the following benefits:

1. Clearer distinction between homozygous reference sites and sites with missing data

Batch-calling does not output a genotype call at sites where no member in the batch has evidence for a variant; it is thus impossible to distinguish such sites from locations missing data. In contrast, joint calling emits genotype calls at every site where any individual in the call set has evidence for variation.

2. Greater sensitivity for low-frequency variants

By sharing information across all samples, joint calling makes it possible to “rescue” genotype calls at sites where a carrier has low coverage but other samples within the call set have a confident variant at that location. However this does not apply to singletons, which are unique to a single sample. To minimize the chance of missing singletons, we increase the cohort size -- so that singletons themselves have less chance of happening in the first place.

3. Greater ability to filter out false positives

The current approaches to variant filtering (such as VQSR) use statistical models that work better with large amounts of data. Of the three calling strategies above, only joint calling provides enough data for accurate error modeling and ensures that filtering is applied uniformly across all samples.

image

Figure 1: Power of joint calling in finding mutations at low coverage sites. The variant allele is present in only two of the N samples, in both cases with such low coverage that the variant is not callable when processed separately. Joint calling allows evidence to be accumulated over all samples and renders the variant callable. (right) Importance of joint calling to square off the genotype matrix, using an example of two disease-relevant variants. Neither sample will have records in a variants-only output file, for different reasons: the first sample is homozygous reference while the second sample has no data. However, merging the results from single sample calling will incorrectly treat both of these samples identically as being non-informative.


Drawbacks of traditional joint calling (all steps performed multi-sample)

There are two major problems with the joint calling strategy.

- Scaling & infrastructure
Joint calling scales very badly -- the calculations involved in variant calling (especially by methods like the HaplotypeCaller’s) become exponentially more computationally costly as you add samples to the cohort. If you don't have a lot of compute available, you run into limitations pretty quickly. Even here at Broad where we have fairly ridiculous amounts of compute available, we can't brute-force our way through the numbers for the larger cohort sizes that we're called on to handle.

- The N+1 problem
When you’re getting a large-ish number of samples sequenced (especially clinical samples), you typically get them in small batches over an extended period of time, and you analyze each batch as it comes in (whether it’s because the analysis is time-sensitive or your PI is breathing down your back). But that’s not joint calling, that’s batch calling, and it doesn’t give you the same significant gains that joint calling can give you. Unfortunately the joint calling approach doesn’t allow for incremental analysis -- every time you get even one new sample sequence, you have to re-call all samples from scratch.

Both of these problems are solved by the single-sample calling + joint genotyping workflow.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • JiahaoJiahao Member Posts: 3

    In addition to the variant detection, I am wondering is it still recommended to do the realignment and recalibration together for multi-sample study? e.g. normal vs tumor.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,443 admin
    edited September 2015

    @Jiahao
    Hi,

    Samples from the same individual should be realigned together. You do not have to worry about BQSR because it takes read groups into account.

    -Sheila

  • MattBMattB NewcastleMember Posts: 39

    I'm curious about the small fraction of singletons that people reported missing, is this still the case with the gVCF workflow and the current version of GATK? When you say "which is reduced by the availability of high quality sequencing data" if I had say exomes with 200 fold depth would singletons with good coverage be less likely to be missed, i.e. is it only poorly covered singletons which I could expect to lose? Additionally has the ExAC project shed any further light on this?

    Issue · Github
    by Sheila

    Issue Number
    266
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,118 admin

    Hi @MattB,

    In principle, confidently called singletons are unlikely to get dropped, regardless of cohort size. Problems occur as you expect when coverage is very low and the initial call confidence is correspondingly low. So the more depth you have, the better (up to a point) but it's not just absolute depth that counts, it's how well that depth is distributed across the genome or the exome targets. You can have 200x depth as mean coverage yet still have targets that are almost entirely uncovered due to various technical difficulties.

    The ExAC project has definitely generated a lot of insight into this sort of thing but we don't have any numbers that we can share. I believe there's a paper in preparation from the MacArthur lab that will cover some of this.

    Geraldine Van der Auwera, PhD

  • MattBMattB NewcastleMember Posts: 39

    Thanks that's very helpful to know, I look forward to the ExAC paper!

  • mglclinicalmglclinical USAMember Posts: 78

    @Geraldine_VdAuwera ,

    Thank you for this article and thank you for clearly describing the scenario on how this Joint Genotyping Analysis works. Yes, the mean depth of coverage could be 200, but some sites may end up with only a few reads.

    Scenario in Sample A : specific singleton mutation with poor coverage (example : 10 reads) vs
    Scenario in Sample B : specific singleton mutation with good coverage (example : 100 reads).

    In scenario A, the "Joint Genotyping Analysis" may not make the variant call as the variant is observed in only 1 sample and the coverage is poor; where as in scenario B it would make a call even though the variant is observed in only sample because it has good/high coverage.

    We are a lab with an expected sample volume not more than 5 at a time, so, our approach is to run variant calling step(-ERC mode) on individual samples and then run GenotypeGVCFs step also on individual samples separately. In the future, when our sample volume increases to 30 or more, then we can change the GenotypeGVCFs step to run on all samples together.

    Even though our current sample volume is less and we will be running GenotypeGVCFs step on individual samples, I still see the benefit of this workflow as it creates the intermediate .g.vcf file that contains the "Homozygous Reference" info i.e. the records.

    Please correct me if my understanding is correct or not.

    Thanks,
    mglclinical

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,118 admin

    @mglclinical Yes, your understanding is correct.

    Note that when you get to larger cohort sizes (thousands of samples and above), it may be helpful to lower the threshold of the QUAL pre-filtering that is done by default by GenotypeGVCFs in order to capture low-confidence singletons. It seems that contrary to what we previously expected, there is a small penalty applied to singletons by the multisample calling algorithm when the population size is very large. The effect is not noticeable in small cohorts but as cohort size grows it is possible that in cases where the coverage or data quality is low, the QUAL of a singleton may be pushed below the threshold. Lowering -stand_emit_conf and -stand_call_conf should be sufficient to recover those calls.

    Geraldine Van der Auwera, PhD

  • willpitcherswillpitchers Michigan State UMember Posts: 6

    I wonder if I could get some guidance on this issue... If I use the HaplotyeCaller tool, followed by GenotypeGVCFs, am I making assumptions about my (non-model) data based on the expected properties of human sequence?

    I'm trying to build a ..vcf file representing 64 (diploid, highly heterozygous) fish, from data with a coverage of ~10x, with the intent of searching for variants associated with a case/control phenotype. We're concerned that our low coverage and the population structure in the data may be so divergent from the properties of the human data that the GATK 'expects' that we cannot trust our results. Are there any known issues with trying to apply the 'best practice' workflow to data from non-model organisms?

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,118 admin
    There are a few technical obstacles that arise if you don't have a prior collection of known variants, as is often the case for non-model organisms. Ie the recalibration tools require known variants (though this can be bootstrapped for BQSR). For variant filtering, you'll need to use hard filters as the machine learning tools won't work on your data.

    Besides that there's not really anything that is assumed that would be human specific. The GVCF workflow does assume that variants are more likely to be real if they're observed in more than one sample -- but given your experimental design I don't think that would be a problem. You're looking for variants that are shared within sub populations, right?

    The low coverage is probably the biggest weakness; you may want want to do some tests comparing sensitivity between traditional multi sample calling (giving all samples simultaneously to the HaplotypeCaller) versus applying the two-step GVCF workflow.

    Geraldine Van der Auwera, PhD

  • willpitcherswillpitchers Michigan State UMember Posts: 6
  • siriansirian USMember Posts: 19

    I wonder if it makes sense to do joint calling over samples that are from different gene panels (not entirely different, but contain probes at different locations of the same gene, and a few different genes).
    Thanks.

    Issue · Github
    by Sheila

    Issue Number
    1430
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,118 admin

    Hi @sirian, in general this is not an obstacle to joint calling as such, although it introduces some limitations. In such a case we prefer to restrict our analysis to the intersection of all interval lists, but you can also choose to run your analysis on the union of all interval lists if you want to maximize the territory that you can examine. The main thing to keep in mind then is that your analysis will be most powerful in regions that are covered in all samples/panel designs, and weakest in those that are unique, especially if you have any samples that are the only one produced by a particular design. In the joint callset this will be reflected by no-calls ("./." genotypes) at sites/samples where data is unavailable due to lack of coverage.

    Geraldine Van der Auwera, PhD

  • J_RJ_R NCMember Posts: 5

    I'm sure this can be found somewhere and I just haven't tracked it down, but I'd like to know more about the mathematical underpinnings of how the joint genotyping workflow works and how it differs from HaplotypeCaller in ordinary VCF mode. Is there a more detailed description anywhere than these articles (https://software.broadinstitute.org/gatk/guide/article?id=3893, http://gatkforums.broadinstitute.org/gatk/discussion/4150/)?
    Thanks!
    Joanna

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,443 admin

    @J_R
    Hi Joanna,

    We do have some basic documents in preparation, but they are not ready for publishing. The best thing to do right now is look at the code.

    -Sheila

  • J_RJ_R NCMember Posts: 5
    edited January 10

    Hi Sheila,

    Great! Thanks so much.

    ~Joanna

  • ltwltw PolandMember Posts: 2

    Hello,

    maybe it's a silly question, but I need a sanity check on this - the title of the first section says "Together is (almost always) better than alone", which implies, that the section regards the comparison between:

    a) variant discovery performed on single samples

    and

    variant discovery performed on multiple samples with one of the following methods:

    b) joint variant calling (BAMs --> VCF)
    c) variant calling on single samples followed by joint genotyping (muliple BAMs --> multiple gVCF, multiple gVCF --> VCF)

    But the subsection "Why "almost always"?" follows the subsection about the option c), and it refers to the option c) as the "new method", (from what I understood, the "old method" is b)):

    (..) some people have reported missing a small fraction of singletons (variants that are unique to individual samples) when using the new method.

    It made me think: does the problem of missing a small fraction of singletons appear in b) and c) but not in a) or does it appear in c) but not in b)?

    The problem of missing a small fraction of singletons seems important for me, as what I am after are rare variants.

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator, Dev Posts: 422 admin

    Hi @ltw,

    The experts are away and when they return they may add to my comment. It seems to me that if rare variants are important to you, then you should test for your data whether the workflows we describe above confidently call these rare variants or not.

  • ltwltw PolandMember Posts: 2

    Thank you for your response @shlee . I'm looking forward to hear from the experts, meanwhile I'll run some tests.

Sign In or Register to comment.