The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# (howto) Recalibrate base quality scores = run BQSR

edited July 2013

#### Objective

Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.

• TBD

#### Steps

1. Analyze patterns of covariation in the sequence dataset
2. Do a second pass to analyze covariation remaining after recalibration
3. Generate before/after plots
4. Apply the recalibration to your sequence data

### 1. Analyze patterns of covariation in the sequence dataset

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R reference.fa \
-L 20 \
-knownSites dbsnp.vcf \
-knownSites gold_indels.vcf \
-o recal_data.table


#### Expected Result

This creates a GATKReport file called recal_data.grp containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.

### 2. Do a second pass to analyze covariation remaining after recalibration

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R reference.fa \
-L 20 \
-knownSites dbsnp.vcf \
-knownSites gold_indels.vcf \
-BQSR recal_data.table \
-o post_recal_data.table


#### Expected Result

This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.

### 3. Generate before/after plots

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T AnalyzeCovariates \
-R reference.fa \
-L 20 \
-before recal_data.table \
-after post_recal_data.table \
-plots recalibration_plots.pdf


#### Expected Result

This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.

### 4. Apply the recalibration to your sequence data

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-R reference.fa \
-L 20 \
-BQSR recal_data.table \


#### Expected Result

This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.

Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 15Member

Hi,

I am trying to generate before/after plots using the following GATK command:

java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human_g1k_v37.fasta -before recal_data_1.table -after post_recal_data_1.table -plots recalibration_plots_1.pdf

#### ERROR ------------------------------------------------------------------------------------------

##### ERROR ------------------------------------------------------------------------------------------

ggplot2 is installed in R (version 2.15.1)

Do you have gsalib installed?

Geraldine Van der Auwera, PhD

• Posts: 15Member

Yes, I have installed gsalib

Try running the BQSR.R script on its own, that should give you a more informative error message about what is wrong.

Geraldine Van der Auwera, PhD

• StanfordPosts: 6Member

Hi,

I ran into some problems when trying to follow this tutorial. The first was when performing step 2 - it said that it needed a set of known sites to be able to carry out the command, specifically requesting a vcf file be provided. Did I possibly do something wrong for that step?

The second problem I ran into was when I tried to create the plots in step 3, it said that a file of reads must not be provided. After I eliminated the -I argument, it created the pdf. (though my plots do not look particularly good, so I do worry about whether the mask file I provided is any good.

Cheers,
Gavin

Hi Gavin,

Both problems were my fault -- I forgot to include the known sets in the command line for step 2, and shouldn't have included the bam file in step 3. It's fixed now.

Geraldine Van der Auwera, PhD

• Posts: 2Member

Since the step 4 use recal_data.table instead of post_recal_data.table, is there any differences for output of step 4(recal_reads.bam) if omitting step 2 and 3?

@vipingjo, skipping steps 2 and 3 does not affect the recalibration itself, it just means that you won't have the before/after plots to evaluate how it went.

Geraldine Van der Auwera, PhD

• Posts: 1Member
edited July 2013

I am (finally) running GATK version 2.6-5. and a freshly installed R package, on MacOSX Lion.
I also had trouble getting the plots from the AnalyzeCovariants command, getting errors from Rscript.
Using the "-l DEBUG" option, the detailed errors indicated missing R packages.
I fixed it by going into R and

install.packages("gplots")
install.packages("reshape")


Enjoy.

Post edited by Geraldine_VdAuwera on
• BethesdaPosts: 2Member
edited October 2013

How to troubleshoot making plots for macs if gsalib isn't working.

In Terminal:

Go to the gatk-master directory. (cd gatk-master).

$ant gsalib If that doesn’t work ("BUILD FAILED") then...$ant clean dist

You should get "BUILD SUCCESSFUL" here.

In R

install.packages("bitops")

install.packages("caTools")

install.packages("colorspace")

install.packages("gtools")

install.packages("gplots")

install.packages("gdata")

install.packages("ggplot2")

install.packages("RColorBrewer")

install.packages("reshape")

install.packages("png")

In Terminal

Go to gatk-master. (cd gatk-master).

$R CMD INSTALL public/R/src/org/broadinstitute/sting/utils/R/gsalib You should get "DONE (gsalib)" here. You’re ready to run your command.$java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R reference.fasta -before before.grp -after after.grp -plots plots.pdf

A pdf should be created. Good luck!

Potential search terms:

ERROR: dependencies ‘gplots’, ‘png’ are not available for package ‘gsalib’

Error: ERROR: no packages specified

ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info

Running R externally.

Hi @Baskine, thanks for posting this run-through. FYI though, gsalib is now available in CRAN, so it's no longer necessary to compile GATK from source to get it. Hopefully this will make it easier for our users who are less comfortable with that sort of thing.

Geraldine Van der Auwera, PhD

• BethesdaPosts: 2Member
edited October 2013

You're welcome!

I did it this way because >install.packages("gsalib") in R wasn't successful for me (I this is what "gsalib is now avalible in CRAN" means?) , even though the package itself looked like it did install successfully. I would get the error message "RScript exited with 1. Run with -l DEBUG for more info" when I ran AnalyzeCovariates.

Yes, that's what I meant. It seems that in some cases installing the library from command line doesn't make it accessible when the Rscript gets run by GATK. It's an R environment setup issue, I think.

Geraldine Van der Auwera, PhD

• Posts: 16Member

Sorry for the stupid question but what does -L 20 mean in the example? As I understand the -L argument defines regions... Thank you

Hi @Eugenie, -L 20 means "run this analysis on the contig called '20'". For humans (the b37 reference build) that is chromosome 20. In the hg19 reference it is called 'chr20', which is admittedly more explicit. If you're using a different organism or reference, the contigs may be named using different conventions.

Geraldine Van der Auwera, PhD

• Posts: 16Member

Dear Geraldine, thank you so much for the explanation. And if I run it for exome data if it worth to specify regions (bed file)? Thank you for your help

Yes, for exome data you should provide the capture regions file. FYI our standard practice is to pad the intervals with +/-50 bp. You can do it by modifying the intervals in your file or you can use the interval padding argument in your command line.

Geraldine Van der Auwera, PhD

• Posts: 16Member

Dear Geraldine, thank you very much for such important note. I realize now that running BQSR and realignment without padding was really stupid...But I suppose that padding intervals depend on the read length? However probably 50 bp is enough for my illumina 100bp;)

I have another issue which is maybe a bit out of topic: I run into the incompatibility between data and known indels contigs. I know, it's very popular and I searched for related threads, but usually it's about the b37/hg19 incompatibility. In my case it's always hg19 but the alignment was done with the reference with only "canonical" contigs: chr1, 2...M. And the known indels from the bundle include patches as I understand. So now I wonder if it's better to edit indels vcf or rerun the alingnment with a patched reference version.
And the question is: if you can suggest where to look for some information about using patched reference version vs old one? I just can't realize if it worth it as the realignment will take quite a long.
Maybe there were some discussions I failed to find...

Thank you very much for your support and sorry for all this mess

Hi @Eugenie,

It's not stupid, I realize now we haven't really documented that anywhere. I will write some documentation about working with exomes, hopefully that will help. We use 50 bp on Illumina 100 bp reads, so you're good to go with that.

You do need to make sure you use matching resources and reference. The known indels in our bundle match the reference we provide as well in the bundle. You should be able to check whether it's the same one you use. The extra contigs won't matter if I recall correctly.

Geraldine Van der Auwera, PhD

Hi @Eugenie,

It's not stupid, I realize now we haven't really documented that anywhere. I will write some documentation about working with exomes, hopefully that will help. We use 50 bp on Illumina 100 bp reads, so you're good to go with that.

You do need to make sure you use matching resources and reference. The known indels in our bundle match the reference we provide as well in the bundle. You should be able to check whether it's the same one you use. The extra contigs won't matter if I recall correctly.

Geraldine Van der Auwera, PhD

• Posts: 16Member

Hello Geraldine,

I meant it seems quite resonable once mentioned, but not obvious before)
But you already did such a great job, thank you!
It's just not so easy to come across related discussions.

But now I see you already mentioned it in some topics. And some more related questions appeared. You told:

... It is indeed better (though not necessary) to use the padded intervals file for base recalibration. For indel realignment it doesn't matter.
...we don't use intervals until the PrintReads step (after Base Recalibration).

(Sorry I don't know how to link posts...)

Please can you help me to clarify it completely: your practice is to run realignment on targeted data (eg exomes) without intervals and include them only in the final step of BQSR?
But in case of BQSR it's more because of running time or there can be any other reasons?

Yes, you were right: only the order and not the number of contigs matters - putting chrM at the end solved the problem, thank you)

Sorry for lots of dummy questions and thank you for your patience!

No problem, I'm here to help. For realignment, it doesn't really matter to the final results if you use intervals or not, because at worst you realign some areas outside the targets for nothing. So it's better to use them (at the RealignerTargetCreator step) to save processing time, but if you don't, your results will still be ok. For recalibration it's more important, because if you have some off-target coverage, those areas will typically have lower mapping quality and higher rate of mismatches that do not represent real variation. So if you don't use intervals, they will be included in the recalibration and will mess up the signal a little, which can reduce the quality of recalibration results. Make sense?

Geraldine Van der Auwera, PhD

• Posts: 16Member

Yes, definitely;) thanks a lot!
My concern was if I miss reads which overlap the target only partially when running realignment with intervals (even with 50bp padding) - I thought maybe that's why you don't use them intentionally at this step...but seems it's not the case? Thank you again for thorough explanation!

You're welcome, I'm glad it helps. Note that the GATK tools typically use all reads that overlap a target of interest; even if part of the reads are "hanging over the edge", they will still get included in the analysis.

Geraldine Van der Auwera, PhD

• Posts: 16Member

Thank you, Geraldine! It was not obvious (for me) and you really reassured me;)
But please allow me to ask for another peace of advice. It was mentioned that even for cancer samples the difference from ref. is not so high to spoil the recalibration. But what if the aim is to call somatic mutations in cancer with a very high mutation rate (let say 10/mb)? If there is a risk to underestimate the quality of somatic mutation calls by this way? I see that it's very individual, but still if you used to apply this step in such cases as well? Thank you for your help!

Hmm, this is not something we have any experience with. Maybe ask in the MuTect forum what would be their recommendation?

Geraldine Van der Auwera, PhD

• SpainPosts: 25Member

Hello Geraldine,

I am running GATK for the first time and I am at the step of analyzing the patterns of covaraition using my own exome data (from humans). I have come with a couple of questions when trying to use the known variants database provided in the bundle (v2.8/b37). There are at least two different versions of dbsn (dbsnp_138.b37.vcf and dbsnp_138.b37.excluding_sites_after_129.vcf) what is the difference between these two?

Hi @vifehe‌,

The dbsnp_138.b37.vcf file contains all sites in dbsnp build 138 without modification. The dbsnp_138.b37.excluding_sites_after_129.vcf is based on the first file, but we've removed the sites that were added to dbsnp since the release of build 129. You can use either one as known sites. If I recall correctly the reason we maintain two different versions is because we had certain benchmarks metrics based on the 129 callset and wanted to have that available, but with updated annotations where applicable.

Geraldine Van der Auwera, PhD

• Posts: 14Member

Hi Geraldine,

Maybe I'm doing something wrong but... when I apply the recalibration to my input bam (using printReads), the size of the output bam is 34M, when the input is 22M. This is OK? Because I thougth that the size "should be" more or less the same... because we are only changing the value of the quality... It's my first time with GATK so forgive me if I'm asking obvious question.

Hi @Irantzu‌,

Don't worry, that's expected. The recalibration adds insertion and deletion qualities (like base call quals but for indels) so the file size can increase quite a bit.

Geraldine Van der Auwera, PhD

• HomePosts: 1Member

I am designing a pipeline to detect predominantly novel mutations across 14 genes (i.e. targeted/enriched, not whole exome). In this case, where I'm looking for novel variants, is it recommended that the BQSR step be skipped? I'm concerned that the quality scores for such variants would be decreased.

Hi,

Yes, you should skip BQSR due to too few bases from only 14 genes.

-Sheila

• kamilo889Posts: 6Member

@Baskine‌ thnaks a lot! after install the R packages works properly! I was suffering in this step!

• kamilo889Posts: 6Member

Thanks*

• kamilo889Posts: 6Member

When I will use the printReads .... which file I must use in the -BQSR the before or after ... in the tutorial appear the first but I wanna confirm...

The "before" file. The "after" one is only used for checking results.

Geraldine Van der Auwera, PhD

• IsraelPosts: 5Member

Hi @Geraldine_VdAuwera,
can you explain why the re-calibrated bam files are almost twice as big as the original bam file?
e.g for input file realigned_reads.bam size of 4.1G, i get after the PrintReads command an output bam file " recal_reads.bam" size of 7.4G.
Why is that?
Thanks,
Yaron

Hi Yaron,

The new file contains the indel quality scores which take up a lot of space. Also included are original quality tags which contribute to the large file size.

-Sheila

• IsraelPosts: 5Member

Thanks @Sheila ,
Just to make sure:
1.Before making the recalibration step ("PrintReads"), indels don't have any quality score?
2.I think the default behavior of the PrintReads command is to NOT to keep the original qualities in order to save space, and only if you add the flag "–emit_original_quals" it will keep the old qualities.

Hi,

1) You are correct that indels do not have quality scores before the recalibration step.

2) Yes, I was uncertain of the default behavior. Thanks for the correction. Have a look at this thread that may help as well: http://gatkforums.broadinstitute.org/discussion/1823/large-bam-file-sizes-after-base-recalibration

-Sheila

• ParisPosts: 25Member
edited October 2014

I have a question concerning the BQSR for bases involved in Insertions and Substutions.
I wonder why in the default command of BQSR the option --insertions_default_quality and --deletions_default_quality are set to 45: this means that all quality scores of such bases will be replaced by 45 for the calculation of quality covariates. For comparison, --mismatches_default_quality is set to -1,meaning that this option is turned off (the actual base qulity given by the raw reads file will be taken into account).
Accordingly, when plots are generated, a unique magenta dot for insertion/substitution (Before BQSR) is present at Q45 in the "reported quality" axis.So this plots don't represent the diversity of reported qualities at indel sites in the original read file. It is then difficult to assess how much BQSR improved the data for those events.

To test this parameter, I ran a BQSR analysis by turning off this option as follows:

java -jar GATK -T BaseRecalibrator \ -R myref.fa -I myBAMsort-RG-dedup-realign.bam -knownSites myDB.vcf \ --deletions_default_quality -1 --insertions_default_quality -1 \ -o BQSR_indelquality.table java -jarGATK -T BaseRecalibrator \
-R myref.fa -I myBAMsort-RG-dedup-realign.bam -knownSites myDB.vcf \
--deletions_default_quality -1 --insertions_default_quality -1 \
-BQSR BQSR_indelquality.table
--deletions_default_quality -1 --insertions_default_quality -1 \
-o post-BQSR_indelquality.table

java -jar \$GATK -T AnalyzeCovariates \
-R myref.fa \
-before BQSR_indelquality.table
-after post-BQSR_indelquality.table
-plots plotsBQSR_indelquality.pdf

But this doesn't change anything: plots are the same and the tables page 5 indicate that all "I" and "D" CIGAR events are all grouped in a Q45 bin.

What is the rationale for such default parameters in default BQSR?
Why has my test not changed these parameters?

Thank you very much for your feedback,
Fabrice

Post edited by FabriceBesnard on
• ParisPosts: 25Member

Hi again,

I add the 2 reports generated by BQSR: one with default parameters and the other with --insertions/deletions_default_quality turned off.

Hi @FabriceBesnard‌,

The reason we need to assign a default value to indel qualities is because there are none in the data before you run BQSR. Using a default allows us to have a value to work with, which we can then increase or decrease according to the covariation patterns observed in the data. In contrast, there are base qualities in the original data, so we don't need to set a default value for them.

Geraldine Van der Auwera, PhD

• ParisPosts: 25Member

Hi Geraldine.
I think that in your last sentence you meant : "In contrast, there are base qualities" [FOR MISMATCHES] "in the original data, so we don't need to set a default value for them". Am I right ?
I am still not sure to understand everything, so let me ask a few more questions (at the risk of looking a bit stupid!), so you can track easier where I am thinking wrong !

For every single base of the reads coming from the sequencer with a fastq format, there is a "reported" quality, whatever the match with the reference genome, be it a mismtach, a deletion or an indel. So I don't get you when you say there is no "indel qualities" in the data before BQSR.
For a deletion, actual reads will have missing bases compared to the reference genome. Since the corresponding reference bases don't exist, of course they don't have a reported quality score and BQSR cannot compute an empirical quality score either. So, now I am wondering which bases are those used to draw the BQSR-deletion plot? Reference bases missing in the reads? The Bases at the border of the deleted region (present in my reads)? Or Actual bases of a reads that are present in a subset of the reads, but absent in the rest of the reads (suggesting a deletion)? And in that latter case, I don't see why the BaseQualities of the reads without the deletion would not been taken into account.
For an insertion, corresponding bases exist in the read, so they have a "reported" base quality. So particularly here, I don't understand the need for setting a default value to the Insertion-Base-Qualities.

I would be happy and grateful if you could explain me what I am missunderstanding here !

Note: In addition I would like to feedback my experience of BQSR with an organism with no dbSNPs or dbIndels available (Even if I am not sure that's the best place to do it). As explained by GATK, I run several rounds of BQSR feeding it with the results of previous variant calling untill convergence.
Before I was using Unified Genotyper and I need 2 to 3 rounds of BQSR/variant Calling to reach convergence in the number of calls.
Now I use Haplotype Caller and only one round is necessary (number of calls stays the same). That's great !

Hi @FabriceBesnard‌,

"In contrast, there are base qualities" [FOR MISMATCHES] "in the original data

You interpretation is correct, that's what I meant. Sorry for being not so explicit.

The difficulty here is that base qualities can't be used as directly for indels as for SNPs. This is most obvious for deletions: a deletion is manifested as the absence of certain bases in your reads. By definition, there are no quality calls for the missing bases, only for the neighboring ones. So initially, you have no confidence estimation for the indels. For insertions the difference compared to SNPs is not quite as intuitive, because you do have bases with qualities, but since they do not align to the reference individually, their qualities are not directly usable as an estimator of confidence by the BQSR. So that's why you the BQSR needs to assign default values, which it then adjusts based on the trends it sees in the data. The way this is done involves statistical modeling over a window around the site of the event, utilizing context information such as the qualities of neighboring bases.

Thanks for your positive feedback about HaplotypeCaller! It's interesting that it also helps with BQSR bootstrapping; I would not have predicted that but it makes a lot of sense.

Geraldine Van der Auwera, PhD

• ParisPosts: 25Member

Than you for your explanations. I think I get it now. But one final question: why are those options (--insertions_default_quality and --deletions_default_quality) available if you cannot apply them to a data set?

Concerning BQSR bootstrapping, I must precise that the the reduction in the number of variants is low: less than 1% of the initial calls are usually dropped in the second call after bootstrapping.

@FabriceBesnard I'm not sure what you mean by "cannot apply them to a data set". If you pass a valid value the program should use that as default quality. What you're observing is just that it's not possible to turn them off (since they are required for the program's operation).

Geraldine Van der Auwera, PhD

• ParisPosts: 25Member

Ok, I get it now. Thank you very much for those explanations !

• RENNES University HospitalPosts: 2Member

Cross-posting from the BQSR forum section, since I did not get any answer yet.
Hi, this is my first try with GATK. After aligning with bwa on ucsc_hg19 reference and de-dup with Picard, I have re-aligned a bam file and started the base re-calibration process before going to SNP calling (as indicated in best practices). I've run everything with default options, and generated the bqsr report file with RStudio and a downloaded BQSR.R script (to circumvent a Rscript PATH issue). I provided the BQSR.R script with the recal.table from the first BaseRecalibrator passage and the csv file from AnalyzeCovariates run after a second passage, as indicated in the "how to recalibrate base quality" tutorial but my feeling is that quality values are worse after recalibration. First I thought it was due to downsampling, but after running the whole process on the entire bam file, I got similar results (see attached file). I'm using the (I think) latest GATX version, v3.3-0-g37228af. Any help on how to interpret my recalibration plots?

Thanks a lot, Fabrice.