The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# (howto) Recalibrate base quality scores = run BQSR

edited July 2013

#### Objective

Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.

• TBD

#### Steps

1. Analyze patterns of covariation in the sequence dataset
2. Do a second pass to analyze covariation remaining after recalibration
3. Generate before/after plots
4. Apply the recalibration to your sequence data

### 1. Analyze patterns of covariation in the sequence dataset

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R reference.fa \
-L 20 \
-knownSites dbsnp.vcf \
-knownSites gold_indels.vcf \
-o recal_data.table


#### Expected Result

This creates a GATKReport file called recal_data.grp containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.

### 2. Do a second pass to analyze covariation remaining after recalibration

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R reference.fa \
-L 20 \
-knownSites dbsnp.vcf \
-knownSites gold_indels.vcf \
-BQSR recal_data.table \
-o post_recal_data.table


#### Expected Result

This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.

### 3. Generate before/after plots

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T AnalyzeCovariates \
-R reference.fa \
-L 20 \
-before recal_data.table \
-after post_recal_data.table \
-plots recalibration_plots.pdf


#### Expected Result

This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.

### 4. Apply the recalibration to your sequence data

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-R reference.fa \
-L 20 \
-BQSR recal_data.table \


#### Expected Result

This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.

Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 13Member

Hi,

I am trying to generate before/after plots using the following GATK command:

java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human_g1k_v37.fasta -before recal_data_1.table -after post_recal_data_1.table -plots recalibration_plots_1.pdf

#### ERROR ------------------------------------------------------------------------------------------

##### ERROR ------------------------------------------------------------------------------------------

ggplot2 is installed in R (version 2.15.1)

Do you have gsalib installed?

Geraldine Van der Auwera, PhD

• Posts: 13Member

Yes, I have installed gsalib

Try running the BQSR.R script on its own, that should give you a more informative error message about what is wrong.

Geraldine Van der Auwera, PhD

• StanfordPosts: 6Member

Hi,

I ran into some problems when trying to follow this tutorial. The first was when performing step 2 - it said that it needed a set of known sites to be able to carry out the command, specifically requesting a vcf file be provided. Did I possibly do something wrong for that step?

The second problem I ran into was when I tried to create the plots in step 3, it said that a file of reads must not be provided. After I eliminated the -I argument, it created the pdf. (though my plots do not look particularly good, so I do worry about whether the mask file I provided is any good.

Cheers, Gavin

Hi Gavin,

Both problems were my fault -- I forgot to include the known sets in the command line for step 2, and shouldn't have included the bam file in step 3. It's fixed now.

Geraldine Van der Auwera, PhD

• Posts: 0Member

Since the step 4 use recal_data.table instead of post_recal_data.table, is there any differences for output of step 4(recal_reads.bam) if omitting step 2 and 3?

@vipingjo, skipping steps 2 and 3 does not affect the recalibration itself, it just means that you won't have the before/after plots to evaluate how it went.

Geraldine Van der Auwera, PhD

• Posts: 1Member
edited July 2013

I am (finally) running GATK version 2.6-5. and a freshly installed R package, on MacOSX Lion. I also had trouble getting the plots from the AnalyzeCovariants command, getting errors from Rscript. Using the "-l DEBUG" option, the detailed errors indicated missing R packages. I fixed it by going into R and

install.packages("gplots")
install.packages("reshape")


Enjoy.

Post edited by Geraldine_VdAuwera on
• BethesdaPosts: 2Member
edited October 2013

How to troubleshoot making plots for macs if gsalib isn't working.

In Terminal:

Go to the gatk-master directory. (cd gatk-master).

$ant gsalib If that doesn’t work ("BUILD FAILED") then...$ant clean dist

You should get "BUILD SUCCESSFUL" here.

In R

install.packages("bitops")

install.packages("caTools")

install.packages("colorspace")

install.packages("gtools")

install.packages("gplots")

install.packages("gdata")

install.packages("ggplot2")

install.packages("RColorBrewer")

install.packages("reshape")

install.packages("png")

In Terminal

Go to gatk-master. (cd gatk-master).

$R CMD INSTALL public/R/src/org/broadinstitute/sting/utils/R/gsalib You should get "DONE (gsalib)" here. You’re ready to run your command.$java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R reference.fasta -before before.grp -after after.grp -plots plots.pdf

A pdf should be created. Good luck!

Potential search terms:

ERROR: dependencies ‘gplots’, ‘png’ are not available for package ‘gsalib’

Error: ERROR: no packages specified

ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info

Running R externally.

Hi @Baskine, thanks for posting this run-through. FYI though, gsalib is now available in CRAN, so it's no longer necessary to compile GATK from source to get it. Hopefully this will make it easier for our users who are less comfortable with that sort of thing.

Geraldine Van der Auwera, PhD

• BethesdaPosts: 2Member
edited October 2013

You're welcome!

I did it this way because >install.packages("gsalib") in R wasn't successful for me (I this is what "gsalib is now avalible in CRAN" means?) , even though the package itself looked like it did install successfully. I would get the error message "RScript exited with 1. Run with -l DEBUG for more info" when I ran AnalyzeCovariates.

Yes, that's what I meant. It seems that in some cases installing the library from command line doesn't make it accessible when the Rscript gets run by GATK. It's an R environment setup issue, I think.

Geraldine Van der Auwera, PhD

• Posts: 13Member

Sorry for the stupid question but what does -L 20 mean in the example? As I understand the -L argument defines regions... Thank you

Hi @Eugenie, -L 20 means "run this analysis on the contig called '20'". For humans (the b37 reference build) that is chromosome 20. In the hg19 reference it is called 'chr20', which is admittedly more explicit. If you're using a different organism or reference, the contigs may be named using different conventions.

Geraldine Van der Auwera, PhD

• Posts: 13Member

Dear Geraldine, thank you so much for the explanation. And if I run it for exome data if it worth to specify regions (bed file)? Thank you for your help

Yes, for exome data you should provide the capture regions file. FYI our standard practice is to pad the intervals with +/-50 bp. You can do it by modifying the intervals in your file or you can use the interval padding argument in your command line.

Geraldine Van der Auwera, PhD

• Posts: 13Member

Dear Geraldine, thank you very much for such important note. I realize now that running BQSR and realignment without padding was really stupid...But I suppose that padding intervals depend on the read length? However probably 50 bp is enough for my illumina 100bp;)

I have another issue which is maybe a bit out of topic: I run into the incompatibility between data and known indels contigs. I know, it's very popular and I searched for related threads, but usually it's about the b37/hg19 incompatibility. In my case it's always hg19 but the alignment was done with the reference with only "canonical" contigs: chr1, 2...M. And the known indels from the bundle include patches as I understand. So now I wonder if it's better to edit indels vcf or rerun the alingnment with a patched reference version. And the question is: if you can suggest where to look for some information about using patched reference version vs old one? I just can't realize if it worth it as the realignment will take quite a long. Maybe there were some discussions I failed to find...

Thank you very much for your support and sorry for all this mess

Hi @Eugenie,

It's not stupid, I realize now we haven't really documented that anywhere. I will write some documentation about working with exomes, hopefully that will help. We use 50 bp on Illumina 100 bp reads, so you're good to go with that.

You do need to make sure you use matching resources and reference. The known indels in our bundle match the reference we provide as well in the bundle. You should be able to check whether it's the same one you use. The extra contigs won't matter if I recall correctly.

Geraldine Van der Auwera, PhD

Hi @Eugenie,

It's not stupid, I realize now we haven't really documented that anywhere. I will write some documentation about working with exomes, hopefully that will help. We use 50 bp on Illumina 100 bp reads, so you're good to go with that.

You do need to make sure you use matching resources and reference. The known indels in our bundle match the reference we provide as well in the bundle. You should be able to check whether it's the same one you use. The extra contigs won't matter if I recall correctly.

Geraldine Van der Auwera, PhD

• Posts: 13Member

Hello Geraldine,

I meant it seems quite resonable once mentioned, but not obvious before) But you already did such a great job, thank you! It's just not so easy to come across related discussions.

But now I see you already mentioned it in some topics. And some more related questions appeared. You told:

... It is indeed better (though not necessary) to use the padded intervals file for base recalibration. For indel realignment it doesn't matter. ...we don't use intervals until the PrintReads step (after Base Recalibration). (Sorry I don't know how to link posts...)

Please can you help me to clarify it completely: your practice is to run realignment on targeted data (eg exomes) without intervals and include them only in the final step of BQSR? But in case of BQSR it's more because of running time or there can be any other reasons?

Yes, you were right: only the order and not the number of contigs matters - putting chrM at the end solved the problem, thank you)

Sorry for lots of dummy questions and thank you for your patience!

No problem, I'm here to help. For realignment, it doesn't really matter to the final results if you use intervals or not, because at worst you realign some areas outside the targets for nothing. So it's better to use them (at the RealignerTargetCreator step) to save processing time, but if you don't, your results will still be ok. For recalibration it's more important, because if you have some off-target coverage, those areas will typically have lower mapping quality and higher rate of mismatches that do not represent real variation. So if you don't use intervals, they will be included in the recalibration and will mess up the signal a little, which can reduce the quality of recalibration results. Make sense?

Geraldine Van der Auwera, PhD

• Posts: 13Member

Yes, definitely;) thanks a lot! My concern was if I miss reads which overlap the target only partially when running realignment with intervals (even with 50bp padding) - I thought maybe that's why you don't use them intentionally at this step...but seems it's not the case? Thank you again for thorough explanation!

You're welcome, I'm glad it helps. Note that the GATK tools typically use all reads that overlap a target of interest; even if part of the reads are "hanging over the edge", they will still get included in the analysis.

Geraldine Van der Auwera, PhD

• Posts: 13Member

Thank you, Geraldine! It was not obvious (for me) and you really reassured me;) But please allow me to ask for another peace of advice. It was mentioned that even for cancer samples the difference from ref. is not so high to spoil the recalibration. But what if the aim is to call somatic mutations in cancer with a very high mutation rate (let say 10/mb)? If there is a risk to underestimate the quality of somatic mutation calls by this way? I see that it's very individual, but still if you used to apply this step in such cases as well? Thank you for your help!

Hmm, this is not something we have any experience with. Maybe ask in the MuTect forum what would be their recommendation?

Geraldine Van der Auwera, PhD

• SpainPosts: 10Member

Hello Geraldine,

I am running GATK for the first time and I am at the step of analyzing the patterns of covaraition using my own exome data (from humans). I have come with a couple of questions when trying to use the known variants database provided in the bundle (v2.8/b37). There are at least two different versions of dbsn (dbsnp_138.b37.vcf and dbsnp_138.b37.excluding_sites_after_129.vcf) what is the difference between these two? thanks in advance

Hi @vifehe‌,

The dbsnp_138.b37.vcf file contains all sites in dbsnp build 138 without modification. The dbsnp_138.b37.excluding_sites_after_129.vcf is based on the first file, but we've removed the sites that were added to dbsnp since the release of build 129. You can use either one as known sites. If I recall correctly the reason we maintain two different versions is because we had certain benchmarks metrics based on the 129 callset and wanted to have that available, but with updated annotations where applicable.

Geraldine Van der Auwera, PhD

• Posts: 2Member

Hi Geraldine,

Maybe I'm doing something wrong but... when I apply the recalibration to my input bam (using printReads), the size of the output bam is 34M, when the input is 22M. This is OK? Because I thougth that the size "should be" more or less the same... because we are only changing the value of the quality... It's my first time with GATK so forgive me if I'm asking obvious question.

Hi @Irantzu‌,

Don't worry, that's expected. The recalibration adds insertion and deletion qualities (like base call quals but for indels) so the file size can increase quite a bit.

Geraldine Van der Auwera, PhD

• HomePosts: 1Member

I am designing a pipeline to detect predominantly novel mutations across 14 genes (i.e. targeted/enriched, not whole exome). In this case, where I'm looking for novel variants, is it recommended that the BQSR step be skipped? I'm concerned that the quality scores for such variants would be decreased.

Hi,

Yes, you should skip BQSR due to too few bases from only 14 genes.

-Sheila

• kamilo889Posts: 3Member

@Baskine‌ thnaks a lot! after install the R packages works properly! I was suffering in this step!

• kamilo889Posts: 3Member

Thanks*