Too many variants called using GATK
Hi, I am working on SOLiD 5500xl data and used SHRiMP2.2.3 for performing mapping. The library type is paired-end. I have read some discussions regarding SOLiD problem but I still have some doubts regarding some steps in best practices
- Local-Realignment at In-dels: Since local realignments take place in base space instead of color-space I doubt the accuracy of the alignment
- Mark/Remove Duplicates: Reads just lying in the same position (end to end) may not necessarily be duplicates. Some of these reads may have putative variants, which otherwise may be filtered out.
- Base quality score recalibration: I am not sure whether this is applicable for 5500xl as well, since quality values have slightly changed on 5500 from previous SOLiD versions as far as I know.
So after mapping, I simply used GATK UnifiedGenotyper to call SNPs and InDels under default values. I end up getting around 40 million variants. Is there any way I can get a more refined variant calling? Do you still consider me applying the above pre-processing steps or do you recommend me applying some variant filteratiion on the called variants? If yes for the previous, then could you explain how my above concerns are taken care of? I was trying to look at some general recommended filter values on INFO fields in VCF format such as BQ, MQ, MQ0, DP, SB etc. Do you recommend some generally used values of these fields on which I can filter and hence refine my variant data?
I may have posted a subset of the above question, which I am not sure was posted successfully since at that time I just created an account. If you have already answered this question then I apologize for that. Could you then provide me the link where you answered it?
Thanks in advance