Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

meaning of 'unified' in UnifiedGenotyper

hrbigelowhrbigelow San FranciscoPosts: 6Member
edited October 2013 in Ask the team

Hi,

The technical docs in UnifiedGenotyper state:

The GATK Unified Genotyper is a multiple-sample, technology-aware SNP and indel caller. It uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples, emitting an accurate posterior probability of there being a segregating variant allele at each locus as well as for the genotype of each sample.

When run with multiple -I inputs, in what ways does the UG algorithm utilize the multiple samples? Does it merely assess each locus in each sample based on the pileup there (and possibly window around it) in isolation, or does it somehow take into account the other samples? Does running it with multiple samples enable it to somehow find variants in some samples that it wouldn't otherwise find, if run on each sample individually?

While I do realize that running in multisample mode allows it to output a nice multi-column VCF format of integrated sample information, is this anything more than just a formatting convenience, the same result that you would achieve if you did the following:

  1. run N individual runs of UG, each with just a single sample
  2. collect the union of all loci that have a variant in at least one sample
  3. re-run UG on each of the N individual samples, and ask it to output calls for the union of the loci in step 2.
  4. merge the results into a single, multicolumn VCF file.

Thanks,

Henry

Post edited by hrbigelow on

Answers

  • ebanksebanks Posts: 671GSA Member mod
    edited October 2013

    Hi Henry,

    The reasons we advocate for running with multiple samples are:

    1. Consistent allele discovery among samples
    2. More empowered filtering by seeing all the data together
    3. Obtaining reference likelihoods for non-variant samples
    4. Accumulating enough confidence at low coverage sites to call positions that would otherwise not be callable for just a single sample.

    Your proposal covers the first 3 reasons and would be okay if you could be sure that your single sample bams all uniformly had deep enough coverage at your regions of interest - or if you don't care about calls at lower coverage regions. But isn't it just easier to call all of your samples together initially?

    Also, FYI: it is called the "Unified" Genotyper because it unified the approaches of 3 of its (now obsolete) predecessors into a single tool.

    Post edited by Geraldine_VdAuwera on

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • hrbigelowhrbigelow San FranciscoPosts: 6Member

    Hi Eric,

    Thanks for your response. Let me make sure I understand what you mean:

    1. Consistent allele discovery among samples

    I suppose by 'consistent', you mean "provide calls for the same set of loci for all samples, even if different subsets of those loci in different samples are not variant". If that's what you mean, then yes, it can be achieved by doing the second-pass genotyping using the specified union of all loci.

    2. More empowered filtering by seeing all the data together

    This sounds like just the "formatting convenience" (i.e. putting the calls of all samples in one line of a single VCF file), right?

    3. Obtaining reference likelihoods for non-variant samples

    This sort of overlaps with objective 1, if I understand what you mean in 1.

    4. Accumulating enough confidence at low coverage sites to call positions that would otherwise not be callable for just a single sample.

    Here I don't understand what you mean, or if I do, i disagree. And, this is the central question. If a single sample, "A" has too low of a coverage at a given locus for its genotype to be callable confidently, then how could you assert that other (possibly independent) samples ("B", "C", "D", ...) could somehow enable the sample A to be callable? The only way I can see this as being plausible is if all samples were used in a general way to somehow derive a background probability model that altered the way the confidence scores were calculated. (But there is no mention of that sort of calculation in the documentation)

    To answer your question, is it just easier to call all samples initially? There are some practical problems with this approach. First, right now, since I am writing from Amgen, I don't have access (because we haven't yet committed to buying a licence) to the most recent GATK, and the version I am using crashes in one of a few different ways (Java GC error being one) when I try to run this on the 113 samples that I have. Secondly, (though not really that important) even if this weren't true, it would mean that adding samples would trigger doing this expensive calculation all over again. Of course, the approach I outline suffers from this inconvenience as well.

    Thanks again,

    Henry

  • ebanksebanks Posts: 671GSA Member mod

    Hi Henry,

    Actually, you are off on some of those points (probably my fault for not explaining sufficiently). However, this is starting to drift outside the scope of this forum (and as a commercial user you should be going through Appistry) so forgive me if I try to keep this quick.

    1. By consistent allele discovery I mean that you are using a consistent set of alleles across all of your samples. This is more a factor for indels where the actual variation can be tricky to determine and occasionally only a subset of your samples show the true variant(s) from the alignments.

    2. Empowered statistical filtering has nothing at all to do with "formatting convenience". What I mean is that when you see consistent artifacts across multiple samples you are more empowered to tell that it is truly a systematic problem (and should be filtered out) than you could be by seeing it just once.

    3. Note that I said that the SITE could be called confidently, not that the SAMPLE could. Given that, the calculation you allude to is absolutely specified in the documentation: it's the "Exact" allele frequency calculation model and is the core of the tool. In it the independent genotype likelihoods are aggregated to create a confidence score for the site's being polymorphic. So it is 100% true that a site where none of the samples could be called confidently individually could itself be called confidently when called over all samples.

    I hope this helps.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • hrbigelowhrbigelow San FranciscoPosts: 6Member

    Hi Eric,

    Thanks for your response. Point 3 is certainly helpful.

    Not sure why you are saying it is outside of the scope? I thought this forum was to help users better understand the tools.

    This post and response from Geraldine seem to suggest that the choice of running UG in single-sample mode vs. multiple sample mode depends on how the various samples are related.

    @Jack said: Thanks, Carneiro, after rerunning my data with the RG tag modified, i finally solved the problem. Here is another question,I have now 10 recalibrated bam files of ten different breeds of chicken, should I call snp by multiple variant calling or call snp individually (sample by sample) ? I got confused because I had read the sentence "It uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples" in the technical documentation page, by saying "in a population of N samples", does it necessarily mean that the N samples belong to a breed ? You know, since in my experiment, each breed only have one sample, so I wonder whether I should use the multiple variant calling method.

    @Geraldine_VdAuwera said: Hi Jack,

    The advantage of running in multi-sample mode is that if there are rare variants that are shared in your cohort, they are more likely to be called rather than dismissed as artifacts. The downside is that if your cohort is very diverse (as may be the case if each sample is from a different breed) the opposite effect will occur: rare variants that are not shared will be less likely to be called. In this case I would not call the samples together, but it's up to you. You may want to test the differences in calls that you get from calling together or separately, on a subset of the genome perhaps.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi Henry,

    I think what Eric means by "drifting outside the scope" is that the forum is primarily intended to help users run the tools, while this conversation is turning into a debate on the finer details of the underlying method. Which is not a bad thing as such, since understanding the methods is key to applying them optimally to your data. But unfortunately our resources are constrained and we don't have as much time as we'd like to discuss this in detail with users. If you're coming to our workshop on 21-22 Oct we'd be happy to discuss further in person.

    Geraldine Van der Auwera, PhD

  • hrbigelowhrbigelow San FranciscoPosts: 6Member

    Hi Geraldine,

    I certainly understand and sympathize with the burden of responding to a large and curious community. And, thank you for your responses so far.

    So, I apologize if this is burdensome, but in the chance that you get time to answer it, worth a shot.

    I saw in this post of a few weeks ago, in response to a similar question from @armen

    @Geraldine_VdAuwera said: Hi armen, ... 3) We currently don't provide detailed documentation for the internal mathematics of either caller. For now we recommend that you look at the code if you want to know exactly how they work. I can point you to the relevant classes if that's something you would be interested in.

    Would you consider amending your article http://www.broadinstitute.org/gatk/guide/article?id=1237 to expound on the details of joint variant estimate, and when it is appropriate to consider samples together vs. separate? For example, multiple timepoints each from several patients. Should samples from each patient be called together, but samples from different patients called separately? I feel an accurate mathematical description would implicitly answer that question from base principles and help users reason about it.

    I get the sense from talking to colleagues and reading posts, that I am not the only one confused as to when and why to consider groups of samples together for joint variant estimation. I'm sorry if this is again a burdensome question, but I feel that better documentation would eliminate lots of posts along these lines.

    Alternatively, if you could point me to the relevant classes, that would be great.

    Thanks again

    Henry

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi Henry,

    Sorry I forgot to answer you in this thread.

    You are right that this is a frequently requested topic, and we have a scheduled task to produce a document on single vs. multisample calling.

    In the meantime, my recommendation would be to start from the UnifiedGenotyperEngine.java class and explore from there. Are you familiar with java?

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.