The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Did we ask for a bug report?

Then follow instructions in Article#1894.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

##### Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Too many variants called using GATK

Posts: 2
edited January 2013

Hi, I am working on SOLiD 5500xl data and used SHRiMP2.2.3 for performing mapping. The library type is paired-end. I have read some discussions regarding SOLiD problem but I still have some doubts regarding some steps in best practices

1. Local-Realignment at In-dels: Since local realignments take place in base space instead of color-space I doubt the accuracy of the alignment
2. Mark/Remove Duplicates: Reads just lying in the same position (end to end) may not necessarily be duplicates. Some of these reads may have putative variants, which otherwise may be filtered out.
3. Base quality score recalibration: I am not sure whether this is applicable for 5500xl as well, since quality values have slightly changed on 5500 from previous SOLiD versions as far as I know.

So after mapping, I simply used GATK UnifiedGenotyper to call SNPs and InDels under default values. I end up getting around 40 million variants. Is there any way I can get a more refined variant calling? Do you still consider me applying the above pre-processing steps or do you recommend me applying some variant filteratiion on the called variants? If yes for the previous, then could you explain how my above concerns are taken care of? I was trying to look at some general recommended filter values on INFO fields in VCF format such as BQ, MQ, MQ0, DP, SB etc. Do you recommend some generally used values of these fields on which I can filter and hence refine my variant data?

I may have posted a subset of the above question, which I am not sure was posted successfully since at that time I just created an account. If you have already answered this question then I apologize for that. Could you then provide me the link where you answered it?

Thanks in advance

Post edited by Geraldine_VdAuwera on
Tagged:

## Best Answer

• Posts: 49
Accepted Answer

Hey there!

First what experiment do you have? species (human..), how many samples, exome/genome? what reference did you use?
40 million variants sure seems like a lot...

Question 1:
IndelRealigner performs great on SOLiD data (LifeScope, BioScope, BWA mapped) and greatly reduces false positive InDel calls which are mainly caused by the "stupid" mapping software which only looks at one read and the reference.

Question 2:
From Picards markduplicates documentation as I understand it a read would only be marked if it had the same sequence and also the same start, end etc.. Also we did some evaluation work on this and never encountered a true variat that was lost because of marking duplicates.

Question 3:
The Problem with the base qualities from mapping software is that it mostly relies on some internal metrics and does not represent the actual probability of mismatching the reference genome. The BQSR is developed and tested for many platfoms including SOLiD (gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibrator).

## Answers

• Posts: 49
Accepted Answer

Hey there!

First what experiment do you have? species (human..), how many samples, exome/genome? what reference did you use?
40 million variants sure seems like a lot...

Question 1:
IndelRealigner performs great on SOLiD data (LifeScope, BioScope, BWA mapped) and greatly reduces false positive InDel calls which are mainly caused by the "stupid" mapping software which only looks at one read and the reference.

Question 2:
From Picards markduplicates documentation as I understand it a read would only be marked if it had the same sequence and also the same start, end etc.. Also we did some evaluation work on this and never encountered a true variat that was lost because of marking duplicates.

Question 3:
The Problem with the base qualities from mapping software is that it mostly relies on some internal metrics and does not represent the actual probability of mismatching the reference genome. The BQSR is developed and tested for many platfoms including SOLiD (gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibrator).

• Posts: 2
edited November 2012

Hi BerntPopp, Thanks for answering to my questions. Many of my doubts are clear but some still remain. My sample was human whole exome. I mapped it against hg19 (UCSC) genome downloaded from Broad Institute repository. Not only SNPs, but I have been getting around 40 million InDels, which is again too liberal according to my experience with whole exome projects on human. Following are my follow-up questions:

1. Regarding Markduplicates, after reading their FAQ, though they claim to take care of clipping, I would like to double check with them on how do they find duplicate reads from local alignment. I would appreciate of you could share the information with me, but for now, it seems like it would be safer to use MarkDuplicates on global alignment output since in the latter case you have complete read information in SAM output.

2. Regarding Base Quality score recalibration, as far I understood from reading Best Practices, is that recalibration takes place on the base quality given by the sequencer. In case of SOLiD, since sequencer gives the color quality (quality of di-base) instead of base, the mapper SHRiMP2 converts the color quality to base quality through forward-backward algorithm to fill in the QUAL field. There is another quality score (MAPQ) that the mapper gives on mapping. So when you said " it(mapper) mostly relies on some internal metrics and does not represent the actual probability of mismatching the reference genome" are you talking about MAPQ or just QUAL? Does BQSR recalibrate MAPQ value as well? I just wanted to make sure.

Thank you very much!

• Posts: 2
edited November 2012

Correction to my above comment. Sorry I did not run Indels correctly. Unifield Genotyper somehow re-ran with default settings. I am going to re-run for InDels and see how many it detects.

Sign In or Register to comment.