The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Best practice using unifiedGenotyper to jointly call multiple genomes in a case-control project

Redwood City, CAMember Posts: 11
edited August 2013

Hi, GATK team

I have a project with 100s of WGS genomes contains unrelated case and control samples. I want to use the joint calling function in unifiedGenotyper to call all these genomes together and do VQSR afterwards. Here are my two questions:

1. Is there a best practice on this type of experimental design? Obviously, I cannot use the ped file to assign the case and control status because they are unrelated. Is there a way of putting that phonotypical data into the joint caller just like the ped file in family study setting?

2. If there is several phonotypical status based on the severeness of the disease instead of just case and control, is there a way to assign the phonotypical status something like extreme-case, mild-case, control, extreme-control? And what is the impact of giving the different shades of case-control status vs. Boolean case-control?

Thank you very much!!

Post edited by Geraldine_VdAuwera on

The GATK currently does not have the capability to handle phenotypical information in this way, sorry. In any case this sounds like something you'd want to work into the analysis you do on the variants after the calling step.

Geraldine Van der Auwera, PhD

• Redwood City, CAMember Posts: 11

Thanks! That is fine. I know it is hard to integrate gradient phenotype into the variant calling. I will do downstream analysis to figure things out.
However, is there a best practice on joint calling over unrelated samples if we only have boolean (affected vs. unaffected) phenotype? I can understand in principle joint calling will increase accuracy over related samples (family, tumor/normal, etc). Do you think unrelated case-control samples will benefit from joint calling over calling them individually?

Thanks again!

Joint calling will improve discovery power for rare variants that are shared in your cohort. But it may diminish your ability to identify rare variants that are not shared in the cohort. So ultimately it depends what you are looking for, exactly. If you expect that the variants of interest will be shared among some of the individuals that share phenotypical properties, then I would call those (and corresponding controls) together, yes. Keep in mind also that calling groups of samples (30+) together also gives you the added benefit of being able to use variant recalibration (VQSR) on them, which is usually not possible (or is severely underpowered) on individually called samples. Does that help clarify things?

Geraldine Van der Auwera, PhD

• Redwood City, CAMember Posts: 11

So your suggestion will be call cases together and separately call control together and do comparison downstream of variant calling. Is that correct?

• Redwood City, CAMember Posts: 11

Does it hurt to call all cases and controls together if I have the computing power. Thanks a lot!

No I would definitely call cases and controls together, so that you can filter them together. What I meant is that if you have case-control groups for different categories of phenotypes, you can separate those groups, e.g. call the "eye-color" group separately from the "shoe-size" group and so on.

Geraldine Van der Auwera, PhD

• Redwood City, CAMember Posts: 11

It is very clear now. Thanks!

• Member Posts: 18

Hi,

1) Do the above hold for HaplotypeCaller too?

2) Assume that I have multiple case cohorts (each with a different disease) and a control cohort, all from the same population, and I want to find statistical associations between variants and a disease/non-diseased Boolean variable. Should I still call all samples together?

3) Could you please point me to the most detailed mathematical description available for each variant caller, so I can better understand the ramifications of calling cases and controls together vs. calling them separately?

Hi @armen,

1) Yes, the above discussion of single vs. multi-sample calling applies equally to UG and for HC.

2) Yes, assuming there are multiple samples in each cohort and the relevant mutations will be shared among them.

3) We currently don't provide detailed documentation for the internal mathematics of either caller. For now we recommend that you look at the code if you want to know exactly how they work. I can point you to the relevant classes if that's something you would be interested in.

Geraldine Van der Auwera, PhD

• Member Posts: 18
edited September 2013

Thank you @Geraldine_VdAuwera.

About 3), I'll check the UG and HC classes out for more information on their variant calling process.

I'm a bit confused about 2) though.

Regarding the first assumption ("there are multiple samples in each cohort"): By definition, there are multiple samples in each cohort, otherwise it wouldn't be a cohort However, assume that I have a single case. What would be the problem?

Regarding the second assumption ("the relevant mutations will be shared among them"): "relevant" in this situation would mean relevant to any of the diseases. If some mutations are relevant to a single disease or only a few diseases, i.e., they are not shared among diseases, would calling all samples from all cohorts together diminish my ability to call these mutations, compared to grouping the samples by cohort and calling each group separately?

Part of my confusion results from the apparent contradiction of this thread with another one. Assume that variant calling is performed on a single sample and a certain variant is assigned a score. Further assume that variant calling is also performed on the whole cohort where the sample comes from. I understand that the appearance of the variant in additional samples in the cohort increases its single-sample score (http://gatkforums.broadinstitute.org/discussion/2218/benefits-of-running-unifiedgenotyper-on-multiple-samples-at-the-same-time). However, does the lack of appearance of the variant in additional samples in the cohort decrease its single-sample score (as implied from your second comment in this thread) or leave it as it is (as implied in ebank's reply to evakoe's comment in this thread: http://gatkforums.broadinstitute.org/discussion/1693/multisample-calling-with-gatk-in-diverse-ethnic-populations)?

Regarding the first assumption ("there are multiple samples in each cohort"): By definition, there are multiple samples in each cohort, otherwise it wouldn't be a cohort However, assume that I have a single case. What would be the problem?

Hah, fair point. Bad phrasing on my end. The first part isn't so much a separate assumption as setting the context for the second statement, i.e. that any mutations of interest would be shared among multiple samples in the cohort if there are multiple samples. If N=1 then the assumption is unnecessary.

By relevant mutation I mean relevant to the disease that affects a particular cohort. If the diseases and corresponding mutations that affect the various cohorts are distinct, i.e. non-overlapping, then it is unnecessary to call them all together. If the cohorts are large enough on their own (i.e. will yield enough calls to be processed with VQSR) you would typically process them separately, to save on computing resources. However, if the cohorts are too small (which is going to be a problem downstream for VQSR) then it makes sense to lump them together for calling in order to get enough "mass" to empower VQSR.

As far as I know, lack of appearance in other samples won't decrease a variant's single-sample score as such, but because certain annotations are relative rather than absolute, that rare variant's ranking may be negatively affected relative to other more prevalent ones. If that rare variant gives a strong signal that may not matter, but if for whatever reason the call is borderline (because mapping or base quals at that site are bad, for example) then its signal gets drowned out a little. This is really only a potential concern for something that only shows up in a very very small proportion of samples though.

Geraldine Van der Auwera, PhD

• UKMember Posts: 24

@Geraldine_VdAuwera said:
No I would definitely call cases and controls together, so that you can filter them together. What I meant is that if you have case-control groups for different categories of phenotypes, you can separate those groups, e.g. call the "eye-color" group separately from the "shoe-size" group and so on.

@Geraldine_VdAuwera said:
However, if the cohorts are too small (which is going to be a problem downstream for VQSR) then it makes sense to lump them together for calling in order to get enough "mass" to empower VQSR.

Then, am I correct to assume that it's suggested to do a joint calling if groups for different phenotypes are small? But if my aim is to identify the mutation that is responsible for each phenotype, do I still do the same, as I think this will diminish the power of detecting such rare mutations.

In terms of the additional samples that are used to make up the number (30+) to empower VQSR, is it better to use 1000G exomes or our in house exomes (usually with different diseases and some of them are related) assuming that population and sequencing platforms are matched with either of the choices. When there is only 1 exome of a disease, do you think adding another 29 samples is a good idea (let's ignore the computing resource for now)? Will this increase the chance of losing the rare mutation for the disease? Would you recommend to use hard filtering in this case?

Sorry for so many questions, I guess the ultimate question is that whether joint calling and adding additional exomes is always a good practice, if the aim it to identify rare mutations for disease and number of exomes for each disease is very small (N=1 or 2). Furthermore, because exomes data keeps coming to our lab, we have plenty of choices of which to use as those additional exomes, so what would be the extra criteria other than population, sequencing platform and exome capture kit used, such as whether from healthy individual, whether related, whether in good sequencing depth?

Many thanks,

The expectation is that even for those samples with unique/rare mutations, you will still get better results calling enough samples together so that you can run VQSR, compared to calling samples individually and applying hard filters. Ultimately that's not really because multisample calling is so great for this application, but because hard filters suck more.

In choosing which exomes to lump together, if they are not already part of a predesigned cohort, the technical properties (sequencing platform, capture kit, target coverage) should probably be the first thing you look at. The reason is that you want to ensure that any technical biases are common to all the exomes (which will help the model consistency). For the rest, we haven't really done systematic testing, so I can't give you a definitive answer, sorry.

Geraldine Van der Auwera, PhD