Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Too many variants called using GATK

rahilrahil Posts: 2Member
edited January 2013 in Ask the team

Hi, I am working on SOLiD 5500xl data and used SHRiMP2.2.3 for performing mapping. The library type is paired-end. I have read some discussions regarding SOLiD problem but I still have some doubts regarding some steps in best practices

  1. Local-Realignment at In-dels: Since local realignments take place in base space instead of color-space I doubt the accuracy of the alignment
  2. Mark/Remove Duplicates: Reads just lying in the same position (end to end) may not necessarily be duplicates. Some of these reads may have putative variants, which otherwise may be filtered out.
  3. Base quality score recalibration: I am not sure whether this is applicable for 5500xl as well, since quality values have slightly changed on 5500 from previous SOLiD versions as far as I know.

So after mapping, I simply used GATK UnifiedGenotyper to call SNPs and InDels under default values. I end up getting around 40 million variants. Is there any way I can get a more refined variant calling? Do you still consider me applying the above pre-processing steps or do you recommend me applying some variant filteratiion on the called variants? If yes for the previous, then could you explain how my above concerns are taken care of? I was trying to look at some general recommended filter values on INFO fields in VCF format such as BQ, MQ, MQ0, DP, SB etc. Do you recommend some generally used values of these fields on which I can filter and hence refine my variant data?

I may have posted a subset of the above question, which I am not sure was posted successfully since at that time I just created an account. If you have already answered this question then I apologize for that. Could you then provide me the link where you answered it?

Thanks in advance

Post edited by Geraldine_VdAuwera on

Best Answer

  • BerntPoppBerntPopp Posts: 45
    Answer ✓

    Hey there!

    First what experiment do you have? species (human..), how many samples, exome/genome? what reference did you use? 40 million variants sure seems like a lot...

    Question 1: IndelRealigner performs great on SOLiD data (LifeScope, BioScope, BWA mapped) and greatly reduces false positive InDel calls which are mainly caused by the "stupid" mapping software which only looks at one read and the reference.

    Question 2: From Picards markduplicates documentation as I understand it a read would only be marked if it had the same sequence and also the same start, end etc.. Also we did some evaluation work on this and never encountered a true variat that was lost because of marking duplicates.

    Question 3: The Problem with the base qualities from mapping software is that it mostly relies on some internal metrics and does not represent the actual probability of mismatching the reference genome. The BQSR is developed and tested for many platfoms including SOLiD (gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibrator).


  • rahilrahil Posts: 2Member
    edited November 2012

    Hi BerntPopp, Thanks for answering to my questions. Many of my doubts are clear but some still remain. My sample was human whole exome. I mapped it against hg19 (UCSC) genome downloaded from Broad Institute repository. Not only SNPs, but I have been getting around 40 million InDels, which is again too liberal according to my experience with whole exome projects on human. Following are my follow-up questions:

    1. Regarding Markduplicates, after reading their FAQ, though they claim to take care of clipping, I would like to double check with them on how do they find duplicate reads from local alignment. I would appreciate of you could share the information with me, but for now, it seems like it would be safer to use MarkDuplicates on global alignment output since in the latter case you have complete read information in SAM output.

    2. Regarding Base Quality score recalibration, as far I understood from reading Best Practices, is that recalibration takes place on the base quality given by the sequencer. In case of SOLiD, since sequencer gives the color quality (quality of di-base) instead of base, the mapper SHRiMP2 converts the color quality to base quality through forward-backward algorithm to fill in the QUAL field. There is another quality score (MAPQ) that the mapper gives on mapping. So when you said " it(mapper) mostly relies on some internal metrics and does not represent the actual probability of mismatching the reference genome" are you talking about MAPQ or just QUAL? Does BQSR recalibrate MAPQ value as well? I just wanted to make sure.

    Thank you very much!

    Post edited by rahil on
  • rahilrahil Posts: 2Member
    edited November 2012

    Correction to my above comment. Sorry I did not run Indels correctly. Unifield Genotyper somehow re-ran with default settings. I am going to re-run for InDels and see how many it detects.

    Post edited by rahil on
Sign In or Register to comment.