We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Single sample or joint calling in experiment with varying sample sizes

I have some genome sequences from 3 closely related non-model species (same genus), and i want to do snp calling and then look at site frequency spectrum, genetic diversity etc .... However, the sample sizes are quite variable (30 vs 10 vs 8). Moreover, the species with 30 is also the species for which the genome is available. Consequently i'm a bit worried about doing joint calling, because singletons are typically undercalled, and i imagine variants will be better detected in the species with 30 samples. Would you agree single sample calling in haplotype caller was probably the best approach here? Someone suggested to me joint calling across all the species, but that seems like a bad idea to me, and i have no idea what biases that would introduce.

Issue · Github
by Sheila

Issue Number
Last Updated
Closed By


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, I understand your concern. Yet single-sample calling is almost never the recommended option, because it leads to missing information that will bite you when you try to compare results across samples. The real questions are

    • Should you use the joint genotyping workflow or attempt traditional multisample calling? (where you pass all samples together to HC, the downside being that performance goes down the toilet as number of samples increases, especially beyond a few tens of samples)


    • Should you call all samples together or by subset according to species?

    For both it depends what you're trying to get in terms of information; for the second it also depends a lot on whether you intend to compare across species or not.

    The main use case where I would favor traditional multisample calling over the GVCF-based joint calling is when you have low coverage data (below 15x) because that already sets you up to have lower sensitivity in the per-sample portion of the pipeline (which wouldn't be any better in pure single-sample calling anyway, at least based on what we've seen).

    If you want to be able to compare results across species, I would definitely joint-call over all three groups of samples. If you're concerned about the calls getting dropped in the smaller, more reference-divergent group, you can try lowering the emit and call confidence thresholds in HC (if running traditional multisample), or in GenotypeGVCFs (if running the GVCF joint calling workflow). It's true that using one species' reference to call variants in other species will cause bias, but that's pretty much inevitable at this point.

    If you don't want to compare results across species, then calling the groups separately is fine.

  • prepagamprepagam Member

    Hi Geraldine
    So, after looking at the data more closely, we have decided to downsample to 6-7 samples per species for just two of the species so to retain only the samples where the coverage is reasonable (15x or higher on average). One of these is the species with the reference genome.
    We do want to compare results various measures of genetic diversity between species. I'm not sure i understand why joint calling across the two species is preferable over joint calling the two groups seperately or single sample calling? Given our interest in the SFS, and known issues with joint calling under-calling singletons, i'm still unsure what to do.
    Thanks in advance

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Clare,

    We cover this in our workshop presentations. In a nutshell, when you joint call across all the samples that you want to compare, for each site where there's a variant in at least one sample, you'll get detailed statistics for that site in all of the samples. This allows you to be confident that samples not called variant are actually hom-ref, rather than not having sufficient data to make a confident call. We consider this to be very important, but the impact might depend on your research question.

    The only under-calling of singletons that we've seen (where you would call them in single-sample calling, as opposed to singletons that are not callable due to insufficient data quality) occurred in very large cohorts and seemed to be resolved by lowering the confidence thresholds. But I can't discount effects in other datasets of course. Have you experienced that problem in your previous work?

  • prepagamprepagam Member

    We have seen under-calling in Unified Genotyper. I'm doing a comparison soon with Haplotype caller, and will let you know how that goes.

  • sprochasprocha Member

    related to Clare's first question... does anyone here knows any refs (or has any idea about) the performance of HC (and joint genotyping workflow) in multispecies data? We're running some simmulations with 100-1000 genes data of closely related species (2-8 species / 2-8 individuals per species) and we seem to be getting very low recall (=sensitivity) values. We're only calling between 10-20% of the true variants. we're using gatk 3.7 and using best practices for joint genotyping (except B/VQSR... as sims are intended for non-model organisms cases, where you won't know the true, thus won't be able to do it). we're using default read filters at the moment and no hard filtering... and are a bit stricked with the results. Any input would be very very appreciated. many thanks in advance!

Sign In or Register to comment.