Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

(howto) Recalibrate base quality scores = run BQSR

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin
edited July 2013 in Tutorials

Objective

Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.

Prerequisites

  • TBD

Steps

  1. Analyze patterns of covariation in the sequence dataset
  2. Do a second pass to analyze covariation remaining after recalibration
  3. Generate before/after plots
  4. Apply the recalibration to your sequence data

1. Analyze patterns of covariation in the sequence dataset

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T BaseRecalibrator \ 
    -R reference.fa \ 
    -I realigned_reads.bam \ 
    -L 20 \ 
    -knownSites dbsnp.vcf \ 
    -knownSites gold_indels.vcf \ 
    -o recal_data.table 

Expected Result

This creates a GATKReport file called recal_data.grp containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.


2. Do a second pass to analyze covariation remaining after recalibration

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T BaseRecalibrator \ 
    -R reference.fa \ 
    -I realigned_reads.bam \ 
    -L 20 \ 
    -knownSites dbsnp.vcf \ 
    -knownSites gold_indels.vcf \ 
    -BQSR recal_data.table \ 
    -o post_recal_data.table 

Expected Result

This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.


3. Generate before/after plots

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T AnalyzeCovariates \ 
    -R reference.fa \ 
    -L 20 \ 
    -before recal_data.table \
    -after post_recal_data.table \
    -plots recalibration_plots.pdf

Expected Result

This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.


4. Apply the recalibration to your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T PrintReads \ 
    -R reference.fa \ 
    -I realigned_reads.bam \ 
    -L 20 \ 
    -BQSR recal_data.table \ 
    -o recal_reads.bam 

Expected Result

This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.

Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • michaelchaomichaelchao Posts: 13Member

    Hi,

    I am trying to generate before/after plots using the following GATK command:

    java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R human_g1k_v37.fasta -before recal_data_1.table -after post_recal_data_1.table -plots recalibration_plots_1.pdf

    and received the following error:

    ERROR ------------------------------------------------------------------------------------------

    ERROR A GATK RUNTIME ERROR has occurred (version 2.6-4-g3e5ff60):
    ERROR
    ERROR Please check the documentation guide to see if this is a known problem
    ERROR If not, please post the error, with stack trace, to the GATK forum
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info.
    ERROR ------------------------------------------------------------------------------------------

    ggplot2 is installed in R (version 2.15.1)

    Thank you for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Do you have gsalib installed?

    Geraldine Van der Auwera, PhD

  • michaelchaomichaelchao Posts: 13Member
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Try running the BQSR.R script on its own, that should give you a more informative error message about what is wrong.

    Geraldine Van der Auwera, PhD

  • Gavin_SherlockGavin_Sherlock StanfordPosts: 6Member

    Hi,

    I ran into some problems when trying to follow this tutorial. The first was when performing step 2 - it said that it needed a set of known sites to be able to carry out the command, specifically requesting a vcf file be provided. Did I possibly do something wrong for that step?

    The second problem I ran into was when I tried to create the plots in step 3, it said that a file of reads must not be provided. After I eliminated the -I argument, it created the pdf. (though my plots do not look particularly good, so I do worry about whether the mask file I provided is any good.

    Cheers, Gavin

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Hi Gavin,

    Both problems were my fault -- I forgot to include the known sets in the command line for step 2, and shouldn't have included the bam file in step 3. It's fixed now.

    Geraldine Van der Auwera, PhD

  • vipingjovipingjo Posts: 0Member

    Since the step 4 use recal_data.table instead of post_recal_data.table, is there any differences for output of step 4(recal_reads.bam) if omitting step 2 and 3?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    @vipingjo, skipping steps 2 and 3 does not affect the recalibration itself, it just means that you won't have the before/after plots to evaluate how it went.

    Geraldine Van der Auwera, PhD

  • symlynxsymlynx Posts: 1Member
    edited July 2013

    I am (finally) running GATK version 2.6-5. and a freshly installed R package, on MacOSX Lion. I also had trouble getting the plots from the AnalyzeCovariants command, getting errors from Rscript. Using the "-l DEBUG" option, the detailed errors indicated missing R packages. I fixed it by going into R and

    install.packages("gplots")  
    install.packages("reshape")
    

    Enjoy.

    Post edited by Geraldine_VdAuwera on
  • BaskineBaskine BethesdaPosts: 2Member
    edited October 2013

    How to troubleshoot making plots for macs if gsalib isn't working.

    Download R (http://www.youtube.com/watch?v=ICGkG7Gg6j0).

    Go here: https://github.com/broadgsa/gatk, click download zip in the bottom right corner, unzip.

    In Terminal:

    Go to the gatk-master directory. (cd gatk-master).

    $ant gsalib

    If that doesn’t work ("BUILD FAILED") then...

    $ant clean dist

    You should get "BUILD SUCCESSFUL" here.

    In R

    install.packages("bitops")

    install.packages("caTools")

    install.packages("colorspace")

    install.packages("gtools")

    install.packages("gplots")

    install.packages("gdata")

    install.packages("ggplot2")

    install.packages("RColorBrewer")

    install.packages("reshape")

    install.packages("png")

    In Terminal

    Go to gatk-master. (cd gatk-master).

    $R CMD INSTALL public/R/src/org/broadinstitute/sting/utils/R/gsalib

    You should get "DONE (gsalib)" here.

    You’re ready to run your command.

    $java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R reference.fasta -before before.grp -after after.grp -plots plots.pdf

    A pdf should be created. Good luck!

    Potential search terms:

    ERROR: dependencies ‘gplots’, ‘png’ are not available for package ‘gsalib’

    Error: ERROR: no packages specified

    ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info

    Running R externally.

    Post edited by Baskine on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Hi @Baskine, thanks for posting this run-through. FYI though, gsalib is now available in CRAN, so it's no longer necessary to compile GATK from source to get it. Hopefully this will make it easier for our users who are less comfortable with that sort of thing.

    Geraldine Van der Auwera, PhD

  • BaskineBaskine BethesdaPosts: 2Member
    edited October 2013

    You're welcome!

    I did it this way because >install.packages("gsalib") in R wasn't successful for me (I this is what "gsalib is now avalible in CRAN" means?) , even though the package itself looked like it did install successfully. I would get the error message "RScript exited with 1. Run with -l DEBUG for more info" when I ran AnalyzeCovariates.

    Post edited by Baskine on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Yes, that's what I meant. It seems that in some cases installing the library from command line doesn't make it accessible when the Rscript gets run by GATK. It's an R environment setup issue, I think.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Sorry for the stupid question but what does -L 20 mean in the example? As I understand the -L argument defines regions... Thank you

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Hi @Eugenie, -L 20 means "run this analysis on the contig called '20'". For humans (the b37 reference build) that is chromosome 20. In the hg19 reference it is called 'chr20', which is admittedly more explicit. If you're using a different organism or reference, the contigs may be named using different conventions.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Dear Geraldine, thank you so much for the explanation. And if I run it for exome data if it worth to specify regions (bed file)? Thank you for your help

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Yes, for exome data you should provide the capture regions file. FYI our standard practice is to pad the intervals with +/-50 bp. You can do it by modifying the intervals in your file or you can use the interval padding argument in your command line.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Dear Geraldine, thank you very much for such important note. I realize now that running BQSR and realignment without padding was really stupid...But I suppose that padding intervals depend on the read length? However probably 50 bp is enough for my illumina 100bp;)

    I have another issue which is maybe a bit out of topic: I run into the incompatibility between data and known indels contigs. I know, it's very popular and I searched for related threads, but usually it's about the b37/hg19 incompatibility. In my case it's always hg19 but the alignment was done with the reference with only "canonical" contigs: chr1, 2...M. And the known indels from the bundle include patches as I understand. So now I wonder if it's better to edit indels vcf or rerun the alingnment with a patched reference version. And the question is: if you can suggest where to look for some information about using patched reference version vs old one? I just can't realize if it worth it as the realignment will take quite a long. Maybe there were some discussions I failed to find...

    Thank you very much for your support and sorry for all this mess

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Hi @Eugenie,

    It's not stupid, I realize now we haven't really documented that anywhere. I will write some documentation about working with exomes, hopefully that will help. We use 50 bp on Illumina 100 bp reads, so you're good to go with that.

    You do need to make sure you use matching resources and reference. The known indels in our bundle match the reference we provide as well in the bundle. You should be able to check whether it's the same one you use. The extra contigs won't matter if I recall correctly.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Hi @Eugenie,

    It's not stupid, I realize now we haven't really documented that anywhere. I will write some documentation about working with exomes, hopefully that will help. We use 50 bp on Illumina 100 bp reads, so you're good to go with that.

    You do need to make sure you use matching resources and reference. The known indels in our bundle match the reference we provide as well in the bundle. You should be able to check whether it's the same one you use. The extra contigs won't matter if I recall correctly.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Hello Geraldine,

    I meant it seems quite resonable once mentioned, but not obvious before) But you already did such a great job, thank you! It's just not so easy to come across related discussions.

    But now I see you already mentioned it in some topics. And some more related questions appeared. You told:

    ... It is indeed better (though not necessary) to use the padded intervals file for base recalibration. For indel realignment it doesn't matter. ...we don't use intervals until the PrintReads step (after Base Recalibration). (Sorry I don't know how to link posts...)

    Please can you help me to clarify it completely: your practice is to run realignment on targeted data (eg exomes) without intervals and include them only in the final step of BQSR? But in case of BQSR it's more because of running time or there can be any other reasons?

    Yes, you were right: only the order and not the number of contigs matters - putting chrM at the end solved the problem, thank you)

    Sorry for lots of dummy questions and thank you for your patience!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    No problem, I'm here to help. For realignment, it doesn't really matter to the final results if you use intervals or not, because at worst you realign some areas outside the targets for nothing. So it's better to use them (at the RealignerTargetCreator step) to save processing time, but if you don't, your results will still be ok. For recalibration it's more important, because if you have some off-target coverage, those areas will typically have lower mapping quality and higher rate of mismatches that do not represent real variation. So if you don't use intervals, they will be included in the recalibration and will mess up the signal a little, which can reduce the quality of recalibration results. Make sense?

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Yes, definitely;) thanks a lot! My concern was if I miss reads which overlap the target only partially when running realignment with intervals (even with 50bp padding) - I thought maybe that's why you don't use them intentionally at this step...but seems it's not the case? Thank you again for thorough explanation!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    You're welcome, I'm glad it helps. Note that the GATK tools typically use all reads that overlap a target of interest; even if part of the reads are "hanging over the edge", they will still get included in the analysis.

    Geraldine Van der Auwera, PhD

  • EugenieEugenie Posts: 13Member

    Thank you, Geraldine! It was not obvious (for me) and you really reassured me;) But please allow me to ask for another peace of advice. It was mentioned that even for cancer samples the difference from ref. is not so high to spoil the recalibration. But what if the aim is to call somatic mutations in cancer with a very high mutation rate (let say 10/mb)? If there is a risk to underestimate the quality of somatic mutation calls by this way? I see that it's very individual, but still if you used to apply this step in such cases as well? Thank you for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Hmm, this is not something we have any experience with. Maybe ask in the MuTect forum what would be their recommendation?

    Geraldine Van der Auwera, PhD

  • vifehevifehe SpainPosts: 10Member

    Hello Geraldine,

    I am running GATK for the first time and I am at the step of analyzing the patterns of covaraition using my own exome data (from humans). I have come with a couple of questions when trying to use the known variants database provided in the bundle (v2.8/b37). There are at least two different versions of dbsn (dbsnp_138.b37.vcf and dbsnp_138.b37.excluding_sites_after_129.vcf) what is the difference between these two? thanks in advance

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Hi @vifehe‌,

    The dbsnp_138.b37.vcf file contains all sites in dbsnp build 138 without modification. The dbsnp_138.b37.excluding_sites_after_129.vcf is based on the first file, but we've removed the sites that were added to dbsnp since the release of build 129. You can use either one as known sites. If I recall correctly the reason we maintain two different versions is because we had certain benchmarks metrics based on the 129 callset and wanted to have that available, but with updated annotations where applicable.

    Geraldine Van der Auwera, PhD

  • IrantzuIrantzu Posts: 2Member

    Hi Geraldine,

    Maybe I'm doing something wrong but... when I apply the recalibration to my input bam (using printReads), the size of the output bam is 34M, when the input is 22M. This is OK? Because I thougth that the size "should be" more or less the same... because we are only changing the value of the quality... It's my first time with GATK so forgive me if I'm asking obvious question.

    Thanks in advance

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,073Administrator, GATK Developer admin

    Hi @Irantzu‌,

    Don't worry, that's expected. The recalibration adds insertion and deletion qualities (like base call quals but for indels) so the file size can increase quite a bit.

    Geraldine Van der Auwera, PhD

  • LindsayJYLindsayJY HomePosts: 1Member

    I am designing a pipeline to detect predominantly novel mutations across 14 genes (i.e. targeted/enriched, not whole exome). In this case, where I'm looking for novel variants, is it recommended that the BQSR step be skipped? I'm concerned that the quality scores for such variants would be decreased.

  • SheilaSheila Broad InstitutePosts: 378Member, GATK Developer, Broadie, Moderator admin

    @LindsayJY

    Hi,

    Yes, you should skip BQSR due to too few bases from only 14 genes.

    -Sheila

  • kamilo889kamilo889 kamilo889Posts: 3Member

    @Baskine‌ thnaks a lot! after install the R packages works properly! I was suffering in this step!

Sign In or Register to comment.