Multi-sample calling on distant samples


I'm working with a non-model species, concretely with citrus genus samples. What we do is basically to search target SNVs which could be responsible for the most of the phenotypic differences between citrus varieties/species.

Due to the absence of a reference assembly for every available species we have implement our genotyping pipeline by mapping all citrus species against the same reference genome (concretely, clementine genome). We are aware that this approach can produce unequal bias that are proportional to the sample-to-reference species distance, but at the end, we know that citrus genus species are relatively close, and its quite easy to find many conserved regions between them.

My question is about multi-sample calling. We are confident on performing multi-sample calling when we compare intra-species samples, but
we are not so sure to follow the same methodology when we compare distant samples that don't share a considerable proportion of variants.

What do you recommend us?

We assume two alternatives

1 -Perform multisample-calling, understanding that despite of genomic heterogeneity the variants will be still detected.
2- Perform independent callings, and combine them after (by using CombineVariants tool)

Thanks in advance


Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Jose,

    We don't really have experience with looking at variants across different species since we're focused on humans so I can't really tell you one way or another. But a key point to understand is what is the benefit of multi-sample calling. When you have low-confidence calls that the caller would reject in a single sample, if it also sees the same variant in other samples of the population, then it will "rescue" the low-confidence call. So for your case it depends how likely that scenario is to happen (and to be meaningful). Does that help?

  • jcarbonelljcarbonell Member

    Thanks a lot for your quick response Geraldine

    I understand that you don't really have too much experience with non-human species, but I think that my case can be analogue to a
    multi-sample calling where individuals from different human populations have been genotyped together, and the ability of the procedure to detect rare variants.

    In relation to this, I have found a previously reported question to the forum

    In this case, Mark DePristo confirmed "the nasty side-effect of making it harder to call the rare variants in both populations because the reads from AFK count against your EUR data". At the end, Mark suggested a possible workflow mixing an independent and a combined callings squaring off the likelihood matrix, that is not very clear for me.

    I believe that my case can be describe as multi-sample calling from "distant samples" where simply, some of the samples won't share their variants with the rest of samples.

    If I understand correctly, the likelihood of a certain alternative allele will get worse if the number of reference samples is increased.

    A extreme case of this situation can be a multi-sample calling where a single AFR individual is genotyped together with, for example, 1000 EUR individuals. The obvious question is, will the private AFR variants be correctly detected? or nevertheless, the 1000 reference-like EUR genotypes will finally confound the AFR variant with noise?

    Thanks again (and sorry for this looong comment)

  • jcarbonelljcarbonell Member

    Thanks a lot Geraldine.

    Your last comment has been very helpful, and now, I have it quite clear.

    Best regards

Sign In or Register to comment.