Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.5 is out. See the GATK4 beta page for download and details.

Should I analyze my samples alone or together?

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

Together is (almost always) better than alone

We recommend performing variant discovery in a way that enables joint analysis of multiple samples, as laid out in our Best Practices workflow. That workflow includes a joint analysis step that empowers variant discovery by providing the ability to leverage population-wide information from a cohort of multiple sample, allowing us to detect variants with great sensitivity and genotype samples as accurately as possible. Our workflow recommendations provide a way to do this in a way that is scalable and allows incremental processing of the sequencing data.

The key point is that you don’t actually have to call variants on all your samples together to perform a joint analysis. We have developed a workflow that allows us to decouple the initial identification of potential variant sites (ie variant calling) from the genotyping step, which is the only part that really needs to be done jointly. Since GATK 3.0, you can use the HaplotypeCaller to call variants individually per-sample in -ERC GVCF mode, followed by a joint genotyping step on all samples in the cohort, as described in this method article. This achieves what we call incremental joint discovery, providing you with all the benefits of classic joint calling (as described below) without the drawbacks.

Why "almost always"? Because some people have reported missing a small fraction of singletons (variants that are unique to individual samples) when using the new method. For most studies, this is an acceptable tradeoff (which is reduced by the availability of high quality sequencing data), but if you are very specifically looking for singletons, you may need to do some careful evaluation before committing to this method.


Previously established cohort analysis strategies

Until recently, three strategies were available for variant discovery in multiple samples:

- single sample calling: sample BAMs are analyzed individually, and individual call sets are combined in a downstream processing step;
- batch calling: sample BAMs are analyzed in separate batches, and batch call sets are merged in a downstream processing step;
- joint calling: variants are called simultaneously across all sample BAMs, generating a single call set for the entire cohort.

The best of these, from the point of view of variant discovery, was joint calling, because it provided the following benefits:

1. Clearer distinction between homozygous reference sites and sites with missing data

Batch-calling does not output a genotype call at sites where no member in the batch has evidence for a variant; it is thus impossible to distinguish such sites from locations missing data. In contrast, joint calling emits genotype calls at every site where any individual in the call set has evidence for variation.

2. Greater sensitivity for low-frequency variants

By sharing information across all samples, joint calling makes it possible to “rescue” genotype calls at sites where a carrier has low coverage but other samples within the call set have a confident variant at that location. However this does not apply to singletons, which are unique to a single sample. To minimize the chance of missing singletons, we increase the cohort size -- so that singletons themselves have less chance of happening in the first place.

3. Greater ability to filter out false positives

The current approaches to variant filtering (such as VQSR) use statistical models that work better with large amounts of data. Of the three calling strategies above, only joint calling provides enough data for accurate error modeling and ensures that filtering is applied uniformly across all samples.

image

Figure 1: Power of joint calling in finding mutations at low coverage sites. The variant allele is present in only two of the N samples, in both cases with such low coverage that the variant is not callable when processed separately. Joint calling allows evidence to be accumulated over all samples and renders the variant callable. (right) Importance of joint calling to square off the genotype matrix, using an example of two disease-relevant variants. Neither sample will have records in a variants-only output file, for different reasons: the first sample is homozygous reference while the second sample has no data. However, merging the results from single sample calling will incorrectly treat both of these samples identically as being non-informative.


Drawbacks of traditional joint calling (all steps performed multi-sample)

There are two major problems with the joint calling strategy.

- Scaling & infrastructure
Joint calling scales very badly -- the calculations involved in variant calling (especially by methods like the HaplotypeCaller’s) become exponentially more computationally costly as you add samples to the cohort. If you don't have a lot of compute available, you run into limitations pretty quickly. Even here at Broad where we have fairly ridiculous amounts of compute available, we can't brute-force our way through the numbers for the larger cohort sizes that we're called on to handle.

- The N+1 problem
When you’re getting a large-ish number of samples sequenced (especially clinical samples), you typically get them in small batches over an extended period of time, and you analyze each batch as it comes in (whether it’s because the analysis is time-sensitive or your PI is breathing down your back). But that’s not joint calling, that’s batch calling, and it doesn’t give you the same significant gains that joint calling can give you. Unfortunately the joint calling approach doesn’t allow for incremental analysis -- every time you get even one new sample sequence, you have to re-call all samples from scratch.

Both of these problems are solved by the single-sample calling + joint genotyping workflow.

Post edited by Geraldine_VdAuwera on

Comments

  • In addition to the variant detection, I am wondering is it still recommended to do the realignment and recalibration together for multi-sample study? e.g. normal vs tumor.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited September 2015

    @Jiahao
    Hi,

    Samples from the same individual should be realigned together. You do not have to worry about BQSR because it takes read groups into account.

    -Sheila

  • MattBMattB NewcastleMember

    I'm curious about the small fraction of singletons that people reported missing, is this still the case with the gVCF workflow and the current version of GATK? When you say "which is reduced by the availability of high quality sequencing data" if I had say exomes with 200 fold depth would singletons with good coverage be less likely to be missed, i.e. is it only poorly covered singletons which I could expect to lose? Additionally has the ExAC project shed any further light on this?

    Issue · Github
    by Sheila

    Issue Number
    266
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @MattB,

    In principle, confidently called singletons are unlikely to get dropped, regardless of cohort size. Problems occur as you expect when coverage is very low and the initial call confidence is correspondingly low. So the more depth you have, the better (up to a point) but it's not just absolute depth that counts, it's how well that depth is distributed across the genome or the exome targets. You can have 200x depth as mean coverage yet still have targets that are almost entirely uncovered due to various technical difficulties.

    The ExAC project has definitely generated a lot of insight into this sort of thing but we don't have any numbers that we can share. I believe there's a paper in preparation from the MacArthur lab that will cover some of this.

  • MattBMattB NewcastleMember

    Thanks that's very helpful to know, I look forward to the ExAC paper!

  • @Geraldine_VdAuwera ,

    Thank you for this article and thank you for clearly describing the scenario on how this Joint Genotyping Analysis works. Yes, the mean depth of coverage could be 200, but some sites may end up with only a few reads.

    Scenario in Sample A : specific singleton mutation with poor coverage (example : 10 reads) vs
    Scenario in Sample B : specific singleton mutation with good coverage (example : 100 reads).

    In scenario A, the "Joint Genotyping Analysis" may not make the variant call as the variant is observed in only 1 sample and the coverage is poor; where as in scenario B it would make a call even though the variant is observed in only sample because it has good/high coverage.

    We are a lab with an expected sample volume not more than 5 at a time, so, our approach is to run variant calling step(-ERC mode) on individual samples and then run GenotypeGVCFs step also on individual samples separately. In the future, when our sample volume increases to 30 or more, then we can change the GenotypeGVCFs step to run on all samples together.

    Even though our current sample volume is less and we will be running GenotypeGVCFs step on individual samples, I still see the benefit of this workflow as it creates the intermediate .g.vcf file that contains the "Homozygous Reference" info i.e. the records.

    Please correct me if my understanding is correct or not.

    Thanks,
    mglclinical

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @mglclinical Yes, your understanding is correct.

    Note that when you get to larger cohort sizes (thousands of samples and above), it may be helpful to lower the threshold of the QUAL pre-filtering that is done by default by GenotypeGVCFs in order to capture low-confidence singletons. It seems that contrary to what we previously expected, there is a small penalty applied to singletons by the multisample calling algorithm when the population size is very large. The effect is not noticeable in small cohorts but as cohort size grows it is possible that in cases where the coverage or data quality is low, the QUAL of a singleton may be pushed below the threshold. Lowering -stand_emit_conf and -stand_call_conf should be sufficient to recover those calls.

  • willpitcherswillpitchers Michigan State UMember

    I wonder if I could get some guidance on this issue... If I use the HaplotyeCaller tool, followed by GenotypeGVCFs, am I making assumptions about my (non-model) data based on the expected properties of human sequence?

    I'm trying to build a ..vcf file representing 64 (diploid, highly heterozygous) fish, from data with a coverage of ~10x, with the intent of searching for variants associated with a case/control phenotype. We're concerned that our low coverage and the population structure in the data may be so divergent from the properties of the human data that the GATK 'expects' that we cannot trust our results. Are there any known issues with trying to apply the 'best practice' workflow to data from non-model organisms?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    There are a few technical obstacles that arise if you don't have a prior collection of known variants, as is often the case for non-model organisms. Ie the recalibration tools require known variants (though this can be bootstrapped for BQSR). For variant filtering, you'll need to use hard filters as the machine learning tools won't work on your data.

    Besides that there's not really anything that is assumed that would be human specific. The GVCF workflow does assume that variants are more likely to be real if they're observed in more than one sample -- but given your experimental design I don't think that would be a problem. You're looking for variants that are shared within sub populations, right?

    The low coverage is probably the biggest weakness; you may want want to do some tests comparing sensitivity between traditional multi sample calling (giving all samples simultaneously to the HaplotypeCaller) versus applying the two-step GVCF workflow.
  • willpitcherswillpitchers Michigan State UMember
  • I wonder if it makes sense to do joint calling over samples that are from different gene panels (not entirely different, but contain probes at different locations of the same gene, and a few different genes).
    Thanks.

    Issue · Github
    by Sheila

    Issue Number
    1430
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @sirian, in general this is not an obstacle to joint calling as such, although it introduces some limitations. In such a case we prefer to restrict our analysis to the intersection of all interval lists, but you can also choose to run your analysis on the union of all interval lists if you want to maximize the territory that you can examine. The main thing to keep in mind then is that your analysis will be most powerful in regions that are covered in all samples/panel designs, and weakest in those that are unique, especially if you have any samples that are the only one produced by a particular design. In the joint callset this will be reflected by no-calls ("./." genotypes) at sites/samples where data is unavailable due to lack of coverage.

  • J_RJ_R NCMember

    I'm sure this can be found somewhere and I just haven't tracked it down, but I'd like to know more about the mathematical underpinnings of how the joint genotyping workflow works and how it differs from HaplotypeCaller in ordinary VCF mode. Is there a more detailed description anywhere than these articles (https://software.broadinstitute.org/gatk/guide/article?id=3893, http://gatkforums.broadinstitute.org/gatk/discussion/4150/)?
    Thanks!
    Joanna

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @J_R
    Hi Joanna,

    We do have some basic documents in preparation, but they are not ready for publishing. The best thing to do right now is look at the code.

    -Sheila

  • J_RJ_R NCMember
    edited January 10

    Hi Sheila,

    Great! Thanks so much.

    ~Joanna

  • ltwltw PolandMember

    Hello,

    maybe it's a silly question, but I need a sanity check on this - the title of the first section says "Together is (almost always) better than alone", which implies, that the section regards the comparison between:

    a) variant discovery performed on single samples

    and

    variant discovery performed on multiple samples with one of the following methods:

    b) joint variant calling (BAMs --> VCF)
    c) variant calling on single samples followed by joint genotyping (muliple BAMs --> multiple gVCF, multiple gVCF --> VCF)

    But the subsection "Why "almost always"?" follows the subsection about the option c), and it refers to the option c) as the "new method", (from what I understood, the "old method" is b)):

    (..) some people have reported missing a small fraction of singletons (variants that are unique to individual samples) when using the new method.

    It made me think: does the problem of missing a small fraction of singletons appear in b) and c) but not in a) or does it appear in c) but not in b)?

    The problem of missing a small fraction of singletons seems important for me, as what I am after are rare variants.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @ltw,

    The experts are away and when they return they may add to my comment. It seems to me that if rare variants are important to you, then you should test for your data whether the workflows we describe above confidently call these rare variants or not.

  • ltwltw PolandMember

    Thank you for your response @shlee . I'm looking forward to hear from the experts, meanwhile I'll run some tests.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ltw
    Hi,

    If the quality score is high for the singletons, they should not be dropped. The issue with singletons occurs when the quality score is low. Please note, in the new GATK4, the issue of singletons being dropped will be addressed.

    I think this thread will give some tips on how to avoid any singletons being dropped.

    -Sheila

    P.S. Let us know how your tests go!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    See also the release notes for 3.7, specifically the section about the new QUAL model, which is opt-in. We think it addresses the singleton dropping problem.
  • JulsJuls Member

    @Sheila @Geraldine_VdAuwera

    Hi,

    I have two experiments:
    (1) I have a number of cell lines which are all related but originally derive from the same organism. We still expect them to be quite different and show a large number of unique variants as these cell lines are very unstable.
    (2) Different but related species called to a common reference. Again we expect a large number of unique variants.
    After variant detection, I would like compare the samples with each other.
    Since I am interested not only in common variants but also (and most importantly) in singletons - so what is unique to the cell line / species, I don't think combining is the right thing for me here according to your comment:

    "Why "almost always"? Because some people have reported missing a small fraction of singletons (variants that are unique to individual samples) when using the new method.

    In addition, the samples are not from a cohort - they are quite different, as outlined above.
    So my question now is - are my analysis steps are ok:
    Since there are no known SNPs for these organisms, I perform variant calling, apply hard filtering and recalibrate with them. Then again after the next variant detection round I apply hard filtering again (I call regular vcf - not gvcf). then I merge them with vcf from a structural variant caller and feed them into snpEff.
    However, now I would like to compare them and further process them in R using a multi-sample vcf. I read this article: http://gatkforums.broadinstitute.org/gatk/discussion/53/combining-variants-from-different-files-into-one
    Since none of the options fits to my problem, I guess I should use CombineVariants here. Is this correct?

    I appreciate any advice! Thanks,
    Julia

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Juls
    Hi Julia,

    I am a little confused. For your two experiments, do you have one VCF for each sample? And now, you want to combine the single sample VCFs into one VCF?

    Am I correct in assuming you want to compare all the samples from your experiments (case 1 and case 2)? I think in that case, it would be best to try calling variants on all your samples together in normal mode and in GVCF mode. Then, you can test which method is the best for your case. It is best to call variants on all samples you want to compare together rather than call variants on each sample separately then combine the single sample VCFs.

    As for missing singletons in the GVCF workflow, you can try using the --useNewAFCalculator which will help with the lost singletons. You can read more about it here.

    -Sheila

  • JulsJuls Member

    Hi @Sheila

    Thank you very much for your answer. Yes, I have one VCF per sample. And yes, I want to compare the samples from my experiments but I am particularly interested in the singletons. I will give --useNewAFCalculator a try.
    Two more questions though:
    1) I do not have a cohort in both cases - the samples are very different and I have very few samples (case 1 - around 15 and case 2 - only 2). That was originally why I thought that joint genotyping wouldn't make much sense. I figured that was for a large number of similar samples.
    2) As I mentioned above, I do two rounds of calling since I don't have any known snps. So I perform variant calling, apply hard filtering and recalibrate with them. For this round it should not make a difference wether I call in the normal or in the gvcf mode, is that correct?

    Thanks again for your advice!
    Best,
    Julia

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Juls
    Hi Julia,

    1) We recommend the joint genotyping for all samples you would like to compare/analyze together. So, if you are interested in comparing genotypes over all 17 samples, you should use joint genotyping. We did develop the GVCF workflow with large cohorts in mind, but it is fine to use with smaller cohorts as well.

    2) Yes, that is correct. For the purpose of base recalibration, you can choose whichever method you prefer/is fastest.

    -Sheila

  • JulsJuls Member

    @Sheila Thank you for your advice!!!

  • jpfloridojpflorido SevilleMember

    Hi,
    Reading this post and the discussions made here, it is not clear to me whether I should analyze samples alone or together.
    In my case, from time to time, I'll have a set of unrelated samples from the same sequencing run (unrelated individuals, each one having a different disease) but coming from the same panel of genes (i.e., the panel comprises genes from different diseases). Should I run the analysis in two separate steps (per-sample calling followed by joint genotyping across samples)? Or, should I run an independent analysis for each sample (i.e., no joint genotyping)?
    Cohort sounds to me to related samples, but maybe I'm confused with this concept here.
    Thanks!
    Javier

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jpflorido
    Hi Javier,

    Are you planning on analyzing the samples all together? Cohort basically means the group of samples you are planning to analyze together. I would say, if you are interested in the 3 bold bullet points above, you should run the joint genotyping workflow.

    -Sheila

Sign In or Register to comment.