Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

The benefit of using a cohort

blueskypyblueskypy Member ✭✭
edited January 2014 in Ask the GATK team

I am working on a large project and I am stuck on whether I should use cohort in variant calling or call samples individually.
Here are my sample info:

  1. Cohort f42: 42 female Caucasian
  2. Among the 42, s1 and s5 are NA12878
  3. All samples were sequenced by the same vendor using Agilent v4pUTR exome capturing kit.
  4. Mean exome sequencing depth: f42: ~80 for each sample; s1: 108; s5:70
  5. Golden standard: Genome in a Bottle (GIB) NA12878 calling set; bed file is the intersection of GIB bed file and v4pUTR region.

My comparison results on SNP:

Table 1: cohort vs individual calling performance
cohort    filter    Sample    NRS      NRD      OGC
f42       filter    s5       0.920    0.003    0.997
s5        filter    s5       0.937    0.001    0.999
f42       filter    s1       0.921    0.002    0.998
s1        filter    s1       0.945    0.001    0.999
f42     noFilter    s5       0.932    0.003    0.997
s5      noFilter    s5       0.947    0.001    0.999
f42     noFilter    s1       0.934    0.003    0.997
s1      noFilter    s1       0.951    0.001    0.999

note:
1. NRS: Non-Reference Sensitivity; NRD: Non-Reference Discrepancy; OGC: OverallGenotype Concordance;
2. cohort s5 means calling s5 individually

I do not see a filter option with VariantEval, so I suppose the following results are for all the called sites

Table 2. Number of variants and Presence in dbSNP
cohort  sample    variants    novelSites
f42       s5       41,978       48
s5        s5       42,714       47
f42       s1       42,042       77
s1        s1       43,088       68

Table 3. Number of common and unique variants in cohort and individual calling
sample    cohort 1    cohort 2    intersection    unique 1    unique 2
s5         f42          s5          41,284         694          1430
s1         f42          s1          41,653         389          1435

Basically, the three tables suggest the following:
Cohort calling leads to lower calling sensitivity and concordance rate, higher discrepancy rate, and fewer called variants, although it does find more novel sites.

Did I do something wrong here or the benefit of a cohort does not apply to samples at 80X depth? The papers demonstrating improved accuracy in multi-sample calling use samples of low depth, such as 4X. Did anyone check that on deeply sequenced samples?

I appreciate any comments!

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    I've posted a new FAQ article about joint calling here, and I believe Craig will be in touch shortly to discuss your specific case.

  • blueskypyblueskypy Member ✭✭

    Thanks so much, Geraldine! Would you please provide larger fig in that post?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Unfortunately I don't have the originals on hand (those come from a grant application) so I'll try but it might take some digging...

  • blueskypyblueskypy Member ✭✭

    hi, Geraldine,
    do you have any suggestion on how to figure out why my cohort run has lower NRS, NRD, and OGC than individual run?

    The pipeline basically includes mapping, sorting, dup marking, realignment, BQSR, HC, and VQSR for snp and indel. Between cohort and individual run, all steps are same except the input to HC: -I f42.list for cohort and -I s1.bam for individual run.

    thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @blueskypy,

    I've discussed this with Craig. On my end I will look at upgrade the docs for genotype concordance metrics to clarify what they mean and what they're useful for, and I believe Craig will work with you to interpret what they say about your results.

  • blueskypyblueskypy Member ✭✭
    edited January 2014

    hi, Geraldine,
    Thanks for the help! Do you mean NRS, NRD, and OGC may not be the correct metrics to show the benefit of joint calling over individual calling? If they are not, what would be such metrics?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh I'm not saying they aren't useful, but it's not strictly a matter of tallying up the numbers and looking at what's higher vs. lower. And it does depend on what you're using as a comparison set, how filtering was done etc. But that's a discussion that is beyond the scope of support I can currently provide on this forum (although I would love to document these methods in more detail, eventually) which is why I refer you to Craig who is taking the lead on this, in consultation with the methods developers and myself as needed.

  • blueskypyblueskypy Member ✭✭

    hi, Geraldine,
    I separated the coverage regions into 0-20X, 40-60X, 80-100X, etc. Within 0-20X, joint calling called more variants and has higher sensitivity than individual calling. But from 40X and deeper, it leads to fewer variant and lower sensitivity. Could you explain that?

    Also in the all coverage regions, joint calling always has higher discrepancy rate than individual calling. why? The GIB call set should be very reliable, right?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Re: the >40x coverage regions, have you gone in and checked what the sites look like in the single vs. joint data? If you have evidence that true variants are being missed, we want to know and we'd look into it to see why they're missing. There could be several reasons; HC could be doing the wrong thing in re-assembly, favoring the wrong haplotypes in the tree, etc. Or there's no issue at the calling stage but it's a filtering issue; you could be either under-filtering the single sets or over-filtering the joint set. Impossible to say up front without more information.

    I can't speak to the reliability of the GIB callset. Constructing a high quality truth set is really hard to do (it's an ongoing project for us) so I would certainly not assume any comp callset would be 100% correct. Also, if the GIB set was produced with single-sample calling it would be reasonable to expect more similar error modes between it and a single-called set than it and a joint-called set. Lower discrepancy could just mean making the same mistakes.

    I'm not saying that joint calling is bullet-proof (and we're always looking to improve the HC), but it's important to keep in mind that experimentally derived "truth" sets are inherently problematic...

  • blueskypyblueskypy Member ✭✭

    hi, Geraldine,
    Thanks for the reply! I'll be in vacation and will look into it again after that! Meanwhile, I wonder if you or anyone is interested in running a test similar to mine. My previous test using 1KG data also had similar results.

    Best!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Enjoy your vacation!

  • blueskypyblueskypy Member ✭✭

    Please see the attached figs! My results show that joint calling leads to higher sensitivity and lower discrepancy ONLY at sites of lower sequencing depths, and goes opposite at sites of higher depths.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @blueskypy,

    Thanks for posting figures. Your main observations fit our expectations quite nicely since multisample calling mostly aims to compensate for low or missing coverage. It's interesting to see that there is a bit of a drop-off at very high depths, though nothing as dramatic as the deficiencies of single-sample calling at low depths. Ultimately our mission in this space (on behalf of the Broad's production arm) is to bring down the amount of sequencing needed per-sample to empower variant discovery (thereby reducing cost), so we are admittedly less concerned with what happens on the high end of the depth range. I will amend the documentation to clarify this and the fact that major gains are mainly to be had at lower sequencing depths.

  • blueskypyblueskypy Member ✭✭

    hi, Geraldine,
    Thanks for the reply! But shouldn't joint calling, at least in theory, should always outperform individual calling because it can have a more accurate estimate of allele frequency? The following paper suggests that "sites with very high read depth can actually have higher error rates because of mapping problems", could that be the reason that joint calling fell behind at sites of high depths? I mean, the local realignment in HC may have trouble to process too many sequences.
    http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.2835.html

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    GATK handles the problem of sites with high read depth by downsampling to a max coverage per sample, so I wouldn't expect that to make much difference. Typically, all samples display the same problems at such sites (because the issues are inherent to the local "geography" of the sites) and coverage is usually maxed-out at the per-sample level even in lower-depth experiments. So I don't think that could realistically account for the differences you observe (you get crap for those sites in any configuration). Now, what I can imagine happening (and note that this is pure speculation on my part) is that you're actually seeing random effects due to downsampling. The HC downsamples pretty aggressively, and for any given window there's a chance that it's only seeing a lopsided picture of the variation present in the data (which is more likely at high depths than at low depths). It would be interesting to see what happens if you do multiple replications of your tests with a randomized element, to evaluate results based on HC seeing different sets of reads every time.

  • blueskypyblueskypy Member ✭✭

    hi, Geraldine,
    I wonder if you can give me some suggestion on joint calling on whole exome seq data with mean seq depth of 70X?

    I have 100 patient samples, 90% are caucasians (the remaining 10% are from a number of different races), about 45% are females. From those 100 samples, we will take a large number of subsets for comparison; for example, one subset containing 60 samples could be high blood pressure vs low blood pressure.

    Now should I just run joint calling only once for the 100 samples or should I do that for each subset?

    Thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    You only need to run joint calling once with the 100 samples. I would recommend also doing VQSR on them all together. Then you can analyze your subsets separately. Have fun!

  • blueskypyblueskypy Member ✭✭

    I just got another 300 samples, so would it be best to joint call all 400 samples once? I can run HC on each chr separately, but even though, is there a way to estimate the time it may take to run chr1? I know it depends on the hardware, but how long would it take in the cluster at broad?

    Thanks so much, Geraldine!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    A long time... My advice: wait another week for the GATK 3.0 release, and use the new workflow, which will allow you to call your samples incrementally with much lower computational cost, then do joint analysis on the full set. It is seriously a game changer.

  • blueskypyblueskypy Member ✭✭

    That sounds really great! Is it certain that 3.0 will be released in a week or so?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Certain, no, but it is what we are aiming for. Admittedly sometimes it takes an extra week if unexpected issues pop up at the last minute. But we are close.

  • blueskypyblueskypy Member ✭✭

    So the major revision of 3.0 will be HC? Will there be much change on BQSR? Since I have 400 samples now, I wonder if I can create the bam files using version 2.8 now and use v3.0 for the remaining steps once it's released?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @blueskypy,

    Generally speaking we don't recommend mixing & matching versions, especially across major version releases (e.g. 2.x to 3.x). We are actually gearing up to trigger the release tomorrow, so I would recommend you just wait for 3.0. It is very close now!

  • blueskypyblueskypy Member ✭✭

    hi, Geraldine,
    I see that v3.0 is released. But I cannot find any docs regarding the major change in HC and how to take advantage of the new HC?

  • pdexheimerpdexheimer Member ✭✭✭✭

    There's docs posted under "Announcements" and Methods & Workflows in the Guide. Look for the phrase "Reference model pipeline for incremental joint variant discovery"

  • blueskypyblueskypy Member ✭✭
    edited April 2014

    hi, Geraldine,

    This is a re-draw of the figures I posted on Feb. 20th. s5 is female Caucasian. Red line is an individual calling on sample s5; f39 is s5 in a joint calling of 39 female Caucasian; c404 is s5 in a joint calling of 404 samples consisting of mostly Caucasians with about half males and females. All samples were of WES at around 70X. The sequencing depth of sites in the fig is from 2 to 120. The gold reference is GIB, same patient as s5. Here are my observations:

    1. GATK v3.1 has higher calling sensitivity and lower discrepancy than v2.8.
    2. Joint calling has better performance for sites below 18X. However, I wonder why c404, with 10 times more samples, does not performance better than f39?
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @blueskypy‌ Can you remind me what is the experimental setup -- what exactly are the data points plotted here?

  • blueskypyblueskypy Member ✭✭

    Basically it's to compare the calling accuracy on sample s5 between joint calling (put s5 in two cohort f39 and c404) and individual calling (calling s5 alone) at sites of different sequencing depth in s5.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sure, I get the idea but I'd like to know what's the underlying data analysis. Meaning, when you have a curve showing sensitivity per depth, are you classifying individual sites per depth observed in the samples, and based on how many variants are called successfully at those sites (out of X from your gold standard), plotting the sensitivity value for that depth? If so, how do you handle cases where the same site is covered at a different depth in different samples (taking an average, ?) ? Or are you downsampling the data to each target depth, calling variants at that depth, and plotting the overall sensitivity you get for each one? Or doing something else entirely?

  • blueskypyblueskypy Member ✭✭
    edited April 2014

    hi, Geraldine, x axis are the sites ONLY in s5. btw, you probably has my email. Feel free to send me an email if you'd like to chat over the phone.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @blueskypy I am now officially deeply confused about these plots (no offense -- I'm probably missing something obvious), but I really don't have time right now to get into this, sorry. I suggest you reach out to Craig and discuss it with him, then he and I will discuss if he needs any input from me.

  • blueskypyblueskypy Member ✭✭

    Hi, Geraldine,
    No problem! I just want to help to improve GATK. I'll talk to Craig.

Sign In or Register to comment.