The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# (howto) Recalibrate base quality scores = run BQSR

edited March 2

#### Objective

Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.

• TBD

#### Steps

1. Analyze patterns of covariation in the sequence dataset
2. Do a second pass to analyze covariation remaining after recalibration
3. Generate before/after plots
4. Apply the recalibration to your sequence data

### 1. Analyze patterns of covariation in the sequence dataset

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R reference.fa \
-L 20 \
-knownSites dbsnp.vcf \
-knownSites gold_indels.vcf \
-o recal_data.table


#### Expected Result

This creates a GATKReport file called recal_data.table containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.

Note that -L 20 is used here and in the next steps to restrict analysis to only chromosome 20 in the b37 human genome reference build. To run against a different reference, you may need to change the name of the contig according to the nomenclature used in your reference.

### 2. Do a second pass to analyze covariation remaining after recalibration

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R reference.fa \
-L 20 \
-knownSites dbsnp.vcf \
-knownSites gold_indels.vcf \
-BQSR recal_data.table \
-o post_recal_data.table


#### Expected Result

This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.

### 3. Generate before/after plots

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T AnalyzeCovariates \
-R reference.fa \
-L 20 \
-before recal_data.table \
-after post_recal_data.table \
-plots recalibration_plots.pdf


#### Expected Result

This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.

### 4. Apply the recalibration to your sequence data

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-R reference.fa \
-L 20 \
-BQSR recal_data.table \


#### Expected Result

This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.

Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.

Geraldine Van der Auwera, PhD

Post edited by Sheila on
Tagged:

• FreiburgPosts: 36

It says the expected output is recal_data.grp but I get (and the command suggests) the output would be recal_data.table ?

Ah, that was a mistake in the text. Fixed now. Thanks for pointing it out.

Geraldine Van der Auwera, PhD

• LisbonPosts: 16

Can you please tell me if I have to run "BaseRecalibrator" and create a "recal_data.table" per BAM file, or if I can create only one table using all my bam files as input (each BAM file is from a different sample). Does it make sense?

This way can I use the same "recal_data.table" to do the PrintReads of all BAM files?

Thanks

@armarkes
Hi,

As long as your read groups are properly labeled, you can feed all the bams into BaseRecalibrator at once. This article may help as well.

-Sheila

• LisbonPosts: 16

Thanks,

Ana

• LisbonPosts: 16

Regarding the same issue, I did that question because all my reads were obtained in the same lane. And I have several samples that were sequenced at the same time.
What I really wanted to know is: if I do the recalibration step using all the bam files (each is from different sample) as input, will I get a better recalibration table? Or it will be the same if I do a table per sample?
Each sample has a different read group, because I wanted to distinguish them if they are all merged in a bam file.
Thanks
Ana

@armarkes
Hi Ana,

BQSR should be done on all the reads in one lane. So, if your samples have been sequenced on the same lane, and the read group reflects that, you will be fine. Have a look at this page for more help on read groups.

-Sheila

• LisbonPosts: 16

Since the only difference I have between my BAM files are the DNA samples (because library was the same, and they were sequenced all together in the same lane of the flowcell), should I run BaseRecalibrator in each bam file independently and create a recal.table for each sample and then run PrintReads in each bam file using its recal.table?

If I do this I will get the graphics (AnalyzeCovariates) per sample, using the recal.tables obtained per sample.

However I read somewhere in this forum that we should use the same recal.table to apply to all bam files (in PrintReads tool) from the same flowcell lane. And it makes sense because the sequencing error will be systematic through all samples.
And if I have only one recal.table generated from all my bam files, I will have only one graphic with pre and post recalibartion procedure.

I would like to know which option is recommended by you. Create only one recal.table using all my bam files as input, or create a recal.table per bam file?

Thanks

@armarkes Both ways are valid.

In the simplest case, where we have one library (LB) of one sample (SM) run on one lane, we run recalibration independently on the bam file that contains this read group's data. In reality we usually run multiple other samples or libraries on the same lane (this is called multiplexing) so from a single lane we get multiple read groups. In our production pipeline this produces one bam file for each read group, and we run recalibration independently on each one (all with their own recal tables) even if they came from the same lane. If your pipeline produces a single bam file containing multiple read groups (that are appropriately distinguished) coming from the same lane, you can run base recalibration on that. You can even produce a single recalibration file by running BaseRecalibrator on multiple bams if you want, though it may take more time and memory.

The results will be the same in all cases because internally the processing is done per read group. This allows us to be able to recalibrate for errors that arise at the library level as well as at the lane level.

I think AnalyzeCovariates should produce separate graphics for each read group but I may be wrong, it has been a long time since I tried to run it that way.

Geraldine Van der Auwera, PhD

• LisbonPosts: 16
edited April 2016

Considering the information I read here https://www.broadinstitute.org/gatk/guide/article?id=6472, I created the read groups with BWA and I gave the same @RG ID to all my sam files (because all my fastq data came from only one single-lane flow cell) but I gave different sample names (@RG SM) to distinguish my samples in the @RG header. I had doubts on that so can you please tell me if I did it right?

example: @RG ID:M01600:44:1 SM:AA551 PL:ILLUMINA PI:250

Thanks again.

No that's wrong -- all the RG IDs need to be different. See the GATK dictionary for more details on read groups.

Geraldine Van der Auwera, PhD

• SpainPosts: 16

Hi,

As suggested in the best practice workflow, since I'm not working with humans, I have performed a SNPcalling with restrictive quality thresholds during the haplotype calling so I can used the detected variants as known sites in the BQSR:

for bams in $files do filename=$(basename "$bams" _realg.bam) java -jar$gatk \
-T HaplotypeCaller \
-I $bams \ -R$genome \
--min_base_quality_score 40 \
--min_mapping_quality_score 40 \
--emitRefConfidence GVCF \
--variant_index_type LINEAR \
--variant_index_parameter 128000 \
-nct 16 \
-R $genome \ -T GenotypeGVCFs \ -V 04N4206.g.vcf \ -V 04N4219.g.vcf \ -V 11004.g.vcf \ -V 11005.g.vcf \ -V 12204.g.vcf \ -V 12M158.g.vcf \ -V 12M159.g.vcf \ -o JUNCO_variants_q40Q40.vcf \ -nt 16 # ---- SELECT VARIANT TOOL java -jar$gatk \
-T SelectVariants \
-V JUNCO_variants_q40Q40.vcf \
-R \$genome \
-o JUNCO_variants_q40Q40_filtered.vcf \
-nt 16 \
-restrictAllelesTo MULTIALLELIC

I get then a really small dataset of ~3800 variants. The problem I'm having is that when I used the resulting dataset to recalibrated the bam files and perform the variant calling once again, I get an empty vcf file. Which can be the problem? Too few variants for the recalibration? Too low quality remaining SNPs?

Thanks a lot!

Guillermo

• KamphaengsaenPosts: 1

I am trying to recalibrate SNPs in a non-model organism. The information on the website tells me that I need a variant / known sites database, but in case that is not available, it recommends "For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation. " I thought this is the GATK online documentation... so where else should I look?? can you put a link to the proper web page in the text?? Thanks in advance

@Hugo_KPS
Hi,

Have a look at this article under "I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs."

-Sheila

• PolandPosts: 15
edited June 2016

##### ERROR MESSAGE: Bad input: The GATK report has no version specified in the header

I tried to use -rf BadCigar as suggested by other post like this

but I still have the same error any suggestion?

@Hugo_KPS That was copy-pasted from a paper manuscript, sorry for the confusion.

Geraldine Van der Auwera, PhD

@medhat
Hi,

-Sheila

@medhat The program is complaining that there is something wrong with the recalibration table file. Have you opened it to see if the contents look ok?

Geraldine Van der Auwera, PhD

• PolandPosts: 15

@Sheila I am using GATK version 3.5
And the Issue as @Geraldine_VdAuwera said was in the .table file not in the .bam file as I understood, I reurn the step to get the .table file and now running the other step , I hope it will finish normally.

Thanks for help

• ParisPosts: 2

This is a very naive question as I'm new in the field.
As discussed all over the documentation, BQSR is essential and cannot be dismissed.
So you should do it whatever variant calling analysis you want to perform.

But how do you get the known polymorphic site database?

I work on a bacteria (Bacillus subtilis) and I don't even know if such database exists.

@blaurent
Hi,

First you should check and find out if indeed there is such a database for your organism. If there is not, please refer to this document under "I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs."

-Sheila

• ParisPosts: 2

@Sheila Thank you for your very quick answer. I'll deep into that. Cheers

-Ben

• SHPosts: 1
edited November 2016

Hi,
It's my first time to use GATK.
What is the detailed mean of -L 20 ?

Thank you!

Post edited by changpeng on

@changpeng
Hi,

-Sheila

• SeattlePosts: 2

I think the answer to @changpeng 's question need to be briefly mentioned at here. Because I go from GATK Best practice, to "Recalibrate Bases", to the tutorial. One should expect a newbie will read it. And since the hg19 bundle on GATK's FTP uses UCSC style, which corresponds to -L chr20 rather than -L 20. So I think this point should also be included in that tutorial. (Otherwise, users will get this Badly formed genome location: Contig '20' does not match any contig in the GATK sequence dictionary derived from the reference. This will frustrate some users.) (Just a suggestion, it's possible that I am just too naive and others may know this.)

The other thing I want to check is, seems that in the latest best practice, IndelRealigner has been removed (because I do not see it). If so, in this tutorial, it should use dedup.bam rather than currently used realigned_reads.bam.

It's the first time I post here in discussion, I find very handy that this editor supports markdown. It would be great if there is a notice like "Markdown supported" below the editor.

Thanks!

#### Issue · Github January 18 by Sheila

Issue Number
1646
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

@RodenLuo You're right; I added a note and generalized the name of the input file. Thanks for reporting this.

We'll see what we can do to highlight the markdown capabilities.

Geraldine Van der Auwera, PhD

• MarylandPosts: 4

@Geraldine_VdAuwera said:
@RodenLuo You're right; I added a note and generalized the name of the input file. Thanks for reporting this.

We'll see what we can do to highlight the markdown capabilities.

Step 2 mentions -I realigned_reads.bam. Is this also supposed to be the same BAM from Step 1 (input_reads.bam`)? Or have I overlooked an input?

@mr_dave
Hi,

Yes, it is supposed to be input_reads.bam. Good catch I just fixed the document.

-Sheila

• LAPosts: 13

Hello,

I have two questions. 1. For the "-knowSites", should the 1000G Indels be included as well. If not, is there a reason to it please? 2. I remember in some old version of GATK, we need to first create intervals before BQSR - is that skipped now? Thank you very much.

@helene

1. 1000G indels are already reported in the dbsnp file.
2. You may be thinking of RealignerTargetCreator, from the realignment step?

Geraldine Van der Auwera, PhD

• LAPosts: 13
1. 1000G indels are already reported in the dbsnp file.
2. You may be thinking of RealignerTargetCreator, from the realignment step?

Thanks so much for the quick reply.