Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

multisample calling with GATK in diverse ethnic populations

dg11dg11 Member
edited October 2012 in Ask the GATK team

Hi
I'm trying to generate a reference panel for imputation from WGS low coverage data from three different African populations using GATK. As data at all sites needs to be complete for a reference panel, the two ways of possibly doing this are multi-sample calling accross all three populations with GATK, or carrying out multi-sample calling per population, and then calling sites variant in any population separately per population prior to merging. I wasn't sure if multi-sample calling with GATK across genetically diverse populations may lead to issues, such as reduced calling of rare variants that appear in one population and not in others? Would you be able to clarify this, please?

Thanks.

Best wishes,
Deepti

Best Answer

  • Mark_DePristoMark_DePristo Broad Institute admin
    Accepted Answer

    For low-coverage data there's a trade-off between singleton (and even doubleton) discovery efficiency and the number of samples in general. The issue is simply that with more ref samples, you naturally require more evidence for a singleton, as the expected number of errors goes up. Here's a concrete example:

    Suppose I have a sample single a single non-ref base with Q10. Overall the chance of being non-ref is 0.1. Now I add 10 samples, each with the same error rate. Now my chance of being non-ref is much lower (it's like close to 0) because I expect to see 1 error with 10 reads across the 10 samples.

    This problem is worst in the case where you have multiple ethnic groups, as rare variants appear to be much more population specific than common variation (the 1000G Phase I paper shows this nicely). So adding 1000 samples from africa to a 1000 samples from europe has the nasty side-effect of making it harder to call the rare variants in both populations because the reads from AFK count against your EUR data, even though the expectation that a singleton in EUR is shared with AFK is lower.

    Some people have directly incorporated this population structure into their calling approach. The best I know is from the Sanger, where they call in each continental group independently, then all samples together, and take the union of the calls, squaring off the likelihood matrix for the union. This workflow is entirely possible with the GATK as well, and if you can manage the informatics headache it's the approach I'd recommend.

    Best,

    Mark

Answers

  • Mark_DePristoMark_DePristo Broad InstituteMember admin
    Accepted Answer

    For low-coverage data there's a trade-off between singleton (and even doubleton) discovery efficiency and the number of samples in general. The issue is simply that with more ref samples, you naturally require more evidence for a singleton, as the expected number of errors goes up. Here's a concrete example:

    Suppose I have a sample single a single non-ref base with Q10. Overall the chance of being non-ref is 0.1. Now I add 10 samples, each with the same error rate. Now my chance of being non-ref is much lower (it's like close to 0) because I expect to see 1 error with 10 reads across the 10 samples.

    This problem is worst in the case where you have multiple ethnic groups, as rare variants appear to be much more population specific than common variation (the 1000G Phase I paper shows this nicely). So adding 1000 samples from africa to a 1000 samples from europe has the nasty side-effect of making it harder to call the rare variants in both populations because the reads from AFK count against your EUR data, even though the expectation that a singleton in EUR is shared with AFK is lower.

    Some people have directly incorporated this population structure into their calling approach. The best I know is from the Sanger, where they call in each continental group independently, then all samples together, and take the union of the calls, squaring off the likelihood matrix for the union. This workflow is entirely possible with the GATK as well, and if you can manage the informatics headache it's the approach I'd recommend.

    Best,

    Mark

  • dg11dg11 Member

    Hi Mark,

    Thanks for clarifying- this is very helpful.

    Best wishes,
    Deepti

  • evakoeevakoe Member

    Hello,
    I have a question that goes into a similar direction than Deepti's.

    I have high coverage (~ 50 X) exome data of about 10 related individuals, 5 have Parkinsons disease and 5 are healthy. (I do not know the exact number of samples yet, but it will be in the range.) I assume that also in this case, it does not make sense to call variants on all 10 samples together, since, as in the case of different populations, I might miss rare variants in my diseased samples.

    Now, if I divide my samples into two groups, I could add additional samples to the healthy group, perform variant calling and variant recalibration as described in the best practices, but for my diseased group I cannot add extra samples. So here I would call the variants on the 5 samples and perform variant recalibration, even though 5 is actually to few.

    If I incorporate Mark's suggestion, I could call variants on my 30 healthy samples and my 5 diseased samples individually, then together and then join the calls.

    @Mark_DePristo, would you also recommend this approach in my case?

    Thank you very much,

    Eva

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Eva,

    It's actually a bit different for disease studies. In this type of analysis you would want to call the cases and controls together to make sure that there's no variation that's being missed (e.g. due to lower coverage in either cohort) and that could lead to false associations downstream. Presumably your cases and controls are all of the same ethnic population so Mark's issues above shouldn't be a factor.

  • evakoeevakoe Member

    Hi Eric,

    thanks alot for the quick reply. Let's assume I call all my samples together and I have a rare, novel variant in 3 of my 5 cases, but in none of my 5 controls. Couldn't it then happen that this variant is identified as a false positive in the recalibration process, even though it truely is a causal variant with low detectance?
    Furthermore, if I had twice or trice as many controls as cases, wouldn't this problem even be greater?

    Thank you very much again.

    Eva

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    That shouldn't happen (not any different than calling cases and controls separately).

  • evakoeevakoe Member

    Great, thanks alot Eric for clarifying this.

    Eva

  • mikemike Member

    Hi,

    I have family-based germline exome-seq study for a genetic disease. We have about 5-6 families, each family has affected and also controls as healthy family members (e.g., parents, sibs, even grandparents etc), but in general about 4-5 samples for each family with at least one affected, however, the families are kind of diverse, one family is from India, one is from hispanic and native American, and rest of the families are from North European Caucasian. Our model for disease is rare autosomal recessive disorder. What would be the best way to call the variants (SNPs and indels)? Based on above,discussion, it seems calling patients and controls together, but we have multiple families from diverse ethnic groups (indian, hispanic, caucasian) with such small sample sizes.

    Any advice would be highly appreciated!

    Thanks

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Mike,

    Our methods guys suggest adding a few (N=10 should be fine) exomes from each of the overlapping 1000 Genomes populations (corresponding to the ethnic groups) and calling everyone together (so about 100 total samples). I believe that should give you good discovery power without losing resolution due to the ethnic diversity.

  • mikemike Member

    Hi, Geraldine:

    Thx a lot for getting back to me and great suggestion. Just want to make sure a few points:

    1. when you say: adding a few (N=10 should be fine) exomes from each of the overlapping 1000 Genomes populations (corresponding to the ethnic groups) - 1KG just announced that the official release of phase3 low coverage and exome data is completed and available on the ftp site. However, our data is done with Illumina Hi-seq (100bp), not sure what we can get from 1KG, Is there any platform difference concern here. For Illumina, hi-seq and GA2X mixture is fine, right?

    2. When you say: calling everyone together (so about 100 total samples), do you mean call all samples from all ethics groups (indian, hispanic, and native American altogether, or altogether for each ethnic group by adding samples for each corresponding ethnic group from 1KG project? Sounds like you suggest to call all samples (with multiple ethnic groups) altogether, if that's the case, I read the comments from Mark in this thread as I copied here: ... The best I know is from the Sanger, where they call in each continental group independently, then all samples together, and take the union of the calls, squaring off the likelihood matrix for the union.... Although I am not so clear about what "take the union of the calls, squaring off the likelihood matrix for the union" means there and why? Any clarification would be highly appreciated!

    3. One of even more complex issue is for some families, there are mixed ethnic background already, e.g., for one family, it is Caucasian and African American inter-race family, and one is mixture family of Hispanic and Native American. What would be the best way to deal with these ethnic groups?

    Thx again for your great help!

    Best

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Mike,

    1. I wouldn't expect any platform issues there, no.

    2. In your case, simply calling all the samples together should be fine. By adding ~10 samples from each corresponding continental group, you're building a reasonable buffer against the possibility of ethnicity-specific variants getting lost in the crowd. As far as I know calling first by continental groups is done more with larger sample groups. If you want to find out more about Sanger's approach I would suggest reading some of their recent papers in that space, since this is beyond the scope of support that we can provide at the moment.

    3. If you call all samples together this problem pretty much goes away.

  • mikemike Member

    Hi, Geraldine:

    Thx a lot for the suggestions and comments. Appreciated very much!

    Do you happen to have reference for Sanger's approach? I checked their web site and not so obvious to find what you refer to.

    Also I did check with 1KG data, there are HapMap Gujarati India individuals from Texas, (YRI) Yoruba individuals
    which could be the background for our india family and african american family respectively. but for hispanic, the best background group might be Puerto Rican in Puerto Rico, HapMap Mexican individuals from LA California etc. Also many Chinese as Asian group in 1KG data, is it worthy to adding those to expand the background dimension? Any benefit from more diverse genetic background or would be distraction from the main theme?

    Thx again for your advice!

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Mike,

    No, unfortunately I can't give you a reference for the Sanger approach -- this is just something that we've discussed informally. I would look at their recent publications for more information.

    For what you're doing I think it can't hurt to add a bit more diversity to pad the background matches. I wouldn't start throwing in Eskimos or Yanomami just for the fun of it, but it can help to hedge your bets, since "Asian" for example encompasses a great many possible origins.

  • mikemike Member

    Thx a lot! Mike

  • blueskypyblueskypy Member ✭✭

    @ebanks said:
    Hi Eva,

    It's actually a bit different for disease studies. In this type of analysis you would want to call the cases and controls together to make sure that there's no variation that's being missed (e.g. due to lower coverage in either cohort) and that could lead to false associations downstream. Presumably your cases and controls are all of the same ethnic population so Mark's issues above shouldn't be a factor.

    I'm confused by Eric's suggestion that the case and control should be called together. My understanding of the best practice is to construct cohort in variant calling, and that the cohort should consist of similar samples; for example, the case and control should be two separate cohorts.

    I have the same concern of Eva's that the rare variants in case might be missed if the case and control are called together. But Eric said that won't happen, why?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If you're combining the same numbers of cases and controls, there will be a balance between absence and presence of variants in the overall cohort, so they won't get missed. Problems arise mostly when you have a very big imbalance within your cohort.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    But the cases and controls are not two different cohorts. They are a single cohort with different phenotypes. To minimize batch effects I would highly recommend calling them all together if possible.

  • blueskypyblueskypy Member ✭✭
    edited October 2013

    hi, Geraldine and Eric,
    Thanks for the quick responses! I really appreciate! So if the case and control are NOT different cohorts, what would be different cohorts in variant calling? In disease study, how to determine whether samples should be in one cohort or two cohorts?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Diabetic Eskimo men vs. Italian women suffering from Alzheimer's, for example... Ethnically different, different gender, completely different health issues. It's not always that clear-cut but that's the general idea.

  • blueskypyblueskypy Member ✭✭

    So if the genetic difference between two groups is big, they should be in different cohorts; otherwise, if the difference is small, even if it's key difference (for example most disease are caused by small number of mutations), they should be in one cohort in variant calling?

    If that understanding is correct, what is the reason for such arrangement? what's the drawback for separating case and control into two cohorts, compared to all in one?

    Sorry for keep bugging!

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Answered above

  • blueskypyblueskypy Member ✭✭

    hi, Eric,
    thanks for the help! So do you mean that if the case and control are called separately, the batch effect is going to bury the small mutations that caused the disease? or at least there will be high false positives due to the batch effect?

    btw, what is the batch effect in exome seq that could affect variant calling?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @blueskypy, unfortunately we're very busy right now, and we don't have the time to go into further detail on this topic at the moment. This is really going into a level of meta discussion about experimental design that is admittedly very important, but we can't make it a priority. We do have a task scheduled to write a complete documentation article on this topic, so I'd like to ask you to please hold off on further questions until we have the time to get that done. We hope to be able to get to it early next week; whether that works out is going to depend on how fast we can get our current scheduled tasks done and out of the way. I'll let you know when we get this documented, okay?

  • blueskypyblueskypy Member ✭✭
    edited October 2013

    thanks so much, Geraldine! I hope your doc could address the following:
    1. does the construction of cohort depends on 1) the frequency of the disease causing variants, 2) the number of samples, and 3) the sequencing depth? In our study, we don't know the frequency of the disease causing variant, but we have >50 samples in both case and control, and x20 sequencing depth.
    2 could you explain the meaning of batch effect in exome seq that could affect variant calling? does it depend on whether it's cell lines or patient tissue which is in our study.

  • blueskypyblueskypy Member ✭✭
    edited October 2013

    What's the largest cohort HaplotypeCaller can handle? We have 700 samples in total from case and control, including 95% white, and 5% of asian, black, etc. All sequenced at 50X coverage.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    In our hands (with pretty decent compute), 100 seems to be the top limit, beyond that it's just too slow...

Sign In or Register to comment.