Call SNPs using data from multiple platforms or different systems of the same platforms?

mikemike Posts: 103Member

Hi,

For SNP calling, the documentation suggested to pool samples together to call Unified genotyper. My questions are:

  1. I have samples for the same study done with exome-seq using Illumina platforms but some using GAII and some using HiSeq 2000 due to historical reasons. My question is: are they OK to be pooled together to call SNPs with Unified genotyper? How about the new HaplotyperCaller? Any concerns on that?

  2. what about data from the same platform but using different exome-capture kits? My take-on for this is probably just the matter of where to look at the variants.

  3. what about data from different platforms? e.g., some from Illumina, some from Ion torrent etc. Any concerns except for the needs of a common interval files for shared regions etc? Anybody tried before? Or just call SNPs for data from the same platform separately?

Thanks a lot for your help! Happy Thanksgiving!

Best

Mike

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Hi Mike,

    We don't have much experience mixing different datatypes so you'll need to experiment on your own, or ask the community. Just keep in mind that different technologies (whether vendor platform or platform version) will produce different error modes, so things like base recalibration that are sensitive to tech specificities should be done separately. But calling and downstream steps should be okay to do on mixed types.

    On the point of capture targets, keep in mind that if you look at the union of target sets, there will be drastic differences in coverage for targets that are not in the intersection, so any analysis that is coverage-dependent should account for this. It would be safer to restrict analysis to the intersection of the sets.

    Let us know how it goes and if there are any particular obstacles you run into. Good luck!

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 372Member, GSA Collaborator ✭✭✭

    My take on this: Any time you're learning covariates across the whole cohort, be very careful about pooling across multiple platforms. So for your data, you absolutely should not mix Illumina and Ion Torrent for base recall/indel realignment (completely different error models), and I would be extremely cautious about pooling for variant calling. My intuition is that GAII/HiSeq have close enough to the same models that you'd be okay for calling, but I would at least separate the cleaning steps by platform.

    The other major learning step is in VQSR. I go out of my way to avoid mixing enrichment platforms at this step because the variant metrics will be different for the platforms (for instance, in the read position and strand composition in the "splash" regions around the probes)

    So to me, the safest approach is to run each platform combination separately - with the possible exception of the GA/HiSeq split - and only merge once I have the filtered VCFs. But I may be overly cautious...

  • mikemike Posts: 103Member

    Thanks for the great comments and info from both of you, which are very helpful!

    However, although I did not ask (I only asked about the UG or SNP calling step), both of you mentioned that should not mix platforms for base recall/indel realignment due to completely different error models, and I actually did on these steps all individually on each sample/bam file. So just curious about why different platforms would have impact on these steps in addition to the later step such as SNP Calling steps?

    Also what about VQSR steps? If I call variants from each platform data separately and then combined the variants before subjected to VQSR, or proceed to VQSR without combining, which would be the best? It seems to me that separately calling variants and run through VQSR before combining the final callsets (after VQSR) would be better.

    Thanks again for your great input!

    Mike

  • mikemike Posts: 103Member

    Thanks a lot for the info, Dear Geraldine! Appreciated your input very much!

    Happy Thanksgiving!

    Mike

  • pdexheimerpdexheimer Posts: 372Member, GSA Collaborator ✭✭✭

    I hadn't considered the possibility of combined realignment, Geraldine, that's a good point. I always lump recal/realign/dedup together in my head, guess I need to keep in mind that they are very separate processes.

    With the caveat that I haven't actually tried this (we don't mess around much with alternate platforms), I would say that combining for the realignment step makes a lot of sense if you've sequenced the same sample - or close family members - on different platforms. But my suspicion is that it would be much less important if you're combining a cohort rather than individuals, since individual small indels are generally pretty rare. But I can't see it hurting, and I do like consistency in my data...

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Thanks for your comments, @pdexheimer. We like consistency too :) BTW, you've been doing a great job jumping in and helping people on the forum. We might have to promote you to honorary GSA member (and I'm only partly kidding).

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member
  • mikemike Posts: 103Member

    Hi,

    just run into a similar but slightly different situation discussed in the above thread, So I come back to this thread for more advice.

    Recently I run into situation from our experiment design, due to the lasting long period of time for the project, for the first batch of experiment we got, we have already called variants with GATK, which is while ago. Later on, we collected more samples (exome-seq) and then combined the samples from both batches and call variants using GATK again. However, now we have even more new samples coming. it become a question to me how to deal with this situation. Not sure how you deal with new samples from the same project vs the old samples already used to call variants. Since GATK suggested to call all samples altogether to get better performance (right?), so each time when we have new samples, we need pool them altogether and use GATK to call variants.and ignore what has been done before. Is there benefit for both old batch and new batch samples by pooling them altogether to call variants (so ignore what have been called previously by just using old samples).

    Due to the need from GATK to call samples together for better performance, it may routinely encounter the practical issue as I encountered, some new samples come later, every time I need to recall variants together with old samples, which would be the better practice?

    How you guys deal with this situation?For exmaple, in 1kG project, you may have 300 samples last year, and then another 400 samples comes in this year, you need to pool them altogether and rec-all variants again, or you just need call them as groups if enough group has decent size of samples? if can do that way, what size of samples in each group can be considered as "decent" size or good enough to call variants for each group individually rather than pool all of them together, which could be big burden for the system as well.

    Look forward to your advice!

    Best

    Mike

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Hi Mike,

    The best practice is to re-call all your samples together. For 1kG we periodically re-call everything. If you have very many samples you will need to develop a robust solution for doing this.

    Geraldine Van der Auwera, PhD

  • mikemike Posts: 103Member

    Hi, Geralding: Thanks a lot for the advice and info. Mike

Sign In or Register to comment.