It looks like you're new here. If you want to get involved, click one of these buttons!
Hi,
For SNP calling, the documentation suggested to pool samples together to call Unified genotyper. My questions are:
I have samples for the same study done with exome-seq using Illumina platforms but some using GAII and some using HiSeq 2000 due to historical reasons. My question is: are they OK to be pooled together to call SNPs with Unified genotyper? How about the new HaplotyperCaller? Any concerns on that?
what about data from the same platform but using different exome-capture kits? My take-on for this is probably just the matter of where to look at the variants.
what about data from different platforms? e.g., some from Illumina, some from Ion torrent etc. Any concerns except for the needs of a common interval files for shared regions etc? Anybody tried before? Or just call SNPs for data from the same platform separately?
Thanks a lot for your help! Happy Thanksgiving!
Best
Mike
Geraldine_VdAuwera
Posts: 2,239 admin
Ah, small misunderstanding because we often use the phrase "calling SNPS" to refer to the entire workflow, not just the calling step.
Re: base recalibration -- the whole point is to correct for sequencing machine errors. Different technologies produce different types of errors. It can be due to apparatus or chemistry, but basically it means that systematic errors will be different: one tech will make a lot of mistakes in GC-rich contexts, while another will mostly make errors at the end of reads. If you use the same correction model for both datasets, the correction will be wrong half the time. And that would be bad, because in the next step the caller would be assessing variants based on badly flawed information.
One could actually make a case for doing indel realignment jointly, since you want your indels to be positioned the same way for all your data... @pdexheimer, do you have any comments on that? As I said, we don't really deal with mixed datasets, so I'd be curious to hear if you have any additional insights from working with this type of issue.
VQSR should be a little less sensitive but there can still be some platform effects. If you can do everything separately and combining VCFs at the end that's probably best, but some people want to combine data earlier if they don't have a lot of data in a single dataset, for example.
Geraldine Van der Auwera, PhD
Answers
Hi Mike,
We don't have much experience mixing different datatypes so you'll need to experiment on your own, or ask the community. Just keep in mind that different technologies (whether vendor platform or platform version) will produce different error modes, so things like base recalibration that are sensitive to tech specificities should be done separately. But calling and downstream steps should be okay to do on mixed types.
On the point of capture targets, keep in mind that if you look at the union of target sets, there will be drastic differences in coverage for targets that are not in the intersection, so any analysis that is coverage-dependent should account for this. It would be safer to restrict analysis to the intersection of the sets.
Let us know how it goes and if there are any particular obstacles you run into. Good luck!
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •My take on this: Any time you're learning covariates across the whole cohort, be very careful about pooling across multiple platforms. So for your data, you absolutely should not mix Illumina and Ion Torrent for base recall/indel realignment (completely different error models), and I would be extremely cautious about pooling for variant calling. My intuition is that GAII/HiSeq have close enough to the same models that you'd be okay for calling, but I would at least separate the cleaning steps by platform.
The other major learning step is in VQSR. I go out of my way to avoid mixing enrichment platforms at this step because the variant metrics will be different for the platforms (for instance, in the read position and strand composition in the "splash" regions around the probes)
So to me, the safest approach is to run each platform combination separately - with the possible exception of the GA/HiSeq split - and only merge once I have the filtered VCFs. But I may be overly cautious...
- Spam
- Abuse
- Troll
1 • Off Topic Disagree 1Agree Like WTF •Thanks for the great comments and info from both of you, which are very helpful!
However, although I did not ask (I only asked about the UG or SNP calling step), both of you mentioned that should not mix platforms for base recall/indel realignment due to completely different error models, and I actually did on these steps all individually on each sample/bam file. So just curious about why different platforms would have impact on these steps in addition to the later step such as SNP Calling steps?
Also what about VQSR steps? If I call variants from each platform data separately and then combined the variants before subjected to VQSR, or proceed to VQSR without combining, which would be the best? It seems to me that separately calling variants and run through VQSR before combining the final callsets (after VQSR) would be better.
Thanks again for your great input!
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks a lot for the info, Dear Geraldine! Appreciated your input very much!
Happy Thanksgiving!
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I hadn't considered the possibility of combined realignment, Geraldine, that's a good point. I always lump recal/realign/dedup together in my head, guess I need to keep in mind that they are very separate processes.
With the caveat that I haven't actually tried this (we don't mess around much with alternate platforms), I would say that combining for the realignment step makes a lot of sense if you've sequenced the same sample - or close family members - on different platforms. But my suspicion is that it would be much less important if you're combining a cohort rather than individuals, since individual small indels are generally pretty rare. But I can't see it hurting, and I do like consistency in my data...
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks for your comments, @pdexheimer. We like consistency too :) BTW, you've been doing a great job jumping in and helping people on the forum. We might have to promote you to honorary GSA member (and I'm only partly kidding).
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks again for both of you! Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi,
just run into a similar but slightly different situation discussed in the above thread, So I come back to this thread for more advice.
Recently I run into situation from our experiment design, due to the lasting long period of time for the project, for the first batch of experiment we got, we have already called variants with GATK, which is while ago. Later on, we collected more samples (exome-seq) and then combined the samples from both batches and call variants using GATK again. However, now we have even more new samples coming. it become a question to me how to deal with this situation. Not sure how you deal with new samples from the same project vs the old samples already used to call variants. Since GATK suggested to call all samples altogether to get better performance (right?), so each time when we have new samples, we need pool them altogether and use GATK to call variants.and ignore what has been done before. Is there benefit for both old batch and new batch samples by pooling them altogether to call variants (so ignore what have been called previously by just using old samples).
Due to the need from GATK to call samples together for better performance, it may routinely encounter the practical issue as I encountered, some new samples come later, every time I need to recall variants together with old samples, which would be the better practice?
How you guys deal with this situation?For exmaple, in 1kG project, you may have 300 samples last year, and then another 400 samples comes in this year, you need to pool them altogether and rec-all variants again, or you just need call them as groups if enough group has decent size of samples? if can do that way, what size of samples in each group can be considered as "decent" size or good enough to call variants for each group individually rather than pool all of them together, which could be big burden for the system as well.
Look forward to your advice!
Best
Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Mike,
The best practice is to re-call all your samples together. For 1kG we periodically re-call everything. If you have very many samples you will need to develop a robust solution for doing this.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, Geralding: Thanks a lot for the advice and info. Mike
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •