The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# No plots generated by the BaseRecalibrator walker

Member
edited August 2012

I am using GATK v2 (GenomeAnalysisTK-2.0-0-g4c0ffd4) and was trying out the new BaseRecalibrator walker. According to this post the BaseRecalibrator should output "A PDF file containing quality control plots showing the patterns of recalibration of the data", however I do not have any such file. Both the BaseRecalibrator and PrintReads steps of the BQSR pipeline appear to have worked as I have a recalibrated BAM file and the accompanying GATKReport but I would like to be able to view plots of the recalibration process (and preferably have these generated automatically by the recalibration pipeline).

Thanks

Tagged:

• Dev

The BQSR.R script should be embedded inside the jar that was downloaded from that link. It gets pulled out and called with Rscript from within the BaseRecalibration walker during runtime. In order to call the Rscript directly, like in the command line I posted, you have to download the source from GitHub to get the source of the R script.

• Dev

Hi there,

The most common cause of this issue is not having Rscript in the environment path. Can you run "which Rscript" and see what it says. If that looks fine then there may be a warning message in the log output of the BaseRecalibrator run that might help us to debug.

Cheers,

• Member
edited October 2012

I can confirm that Rscript is in my path. I'll need to re-run BaseRecalibrator in order to get the log as I foolishly didn't save it to file. I'll post that once I have done so.

Cheers

unix303 507 % echo PATH /usr/local/bioinf/bin:/cluster/home/users/lab0605/bioinformatics/bahlolab_db/annovar:/usr/local/bioinf/bin:/usr/local/bioinfsoftware/root/current/bin:/usr/local/bioinfsoftware/polyphen/current/bin/:/usr/local/bioinfsoftware/perl/current/bin/:/usr/local/bioinfsoftware/DiversityEstimates/current/bin:/usr/local/bioinfsoftware/DiversityEstimates/current/Scripts:/usr/local/bioinfsoftware/AmpliconNoise/current/bin:/usr/local/bioinfsoftware/AmpliconNoise/current/Scripts:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/users/lab0605/hickey/../bin/Linux:/home/users/lab0605/hickey/../bin:/usr/local/bioinfsoftware/vaast/current/bin:/usr/local/bioinfsoftware/vaast/current/bin/vaast_tools unix303 506 % which Rscript /usr/local/bioinf/bin/Rscript  Post edited by Geraldine_VdAuwera on • Broad InstituteMember, Broadie, Dev Note that you can just run the tool on a tiny interval (e.g. 1:1-1000) if you want to regenerate the error without waiting for it to rerun over the whole genome. • Member I've rerun BaseRecalibrator with the -l DEBUG' flag. Here is (I think) the relevant output: INFO 08:20:48,388 BaseRecalibrator - Generating recalibration plots... DEBUG 08:20:48,827 RScriptExecutor - Executing: DEBUG 08:20:48,827 RScriptExecutor - Rscript DEBUG 08:20:48,827 RScriptExecutor - -e DEBUG 08:20:48,828 RScriptExecutor - tempLibDir = '/tmp/Rlib.2011861167067641148';source('/tmp/BQSR.1384038019643928030.R'); DEBUG 08:20:48,829 RScriptExecutor - /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv DEBUG 08:20:48,829 RScriptExecutor - /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv.pdf Loading required package: methods Error in continuous_scale(c("y", "ymin", "ymax", "yend", "yintercept", : unused argument(s) (formatter = "comma") Calls: source ... eval -> eval -> scale_y_continuous -> continuous_scale In addition: Warning messages: 1: NAs introduced by coercion 2: NAs introduced by coercion Execution halted DEBUG 08:20:53,553 RScriptExecutor - Result: 1 WARN 08:20:53,553 RScriptExecutor - RScript exited with 1  I can send the full log file if required. Cheers • Dev Hi there, It looks like the data being sent to the plotting script is somehow malformed. Can you post the output of this file: /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv Even the first 20 lines or so would be helpful to debug. Out of curiosity did you happen to run with Eric's trick of -L 1:1-1000 or is this error also from running over a larger region? Thanks! • Member edited August 2012 Hi Ryan, Hmm, the plot thickens. I re-ran BaseRecalibrator and still did not get a pdf plot, however the error message changed. Both attempts to use the BaseRecalibrator were using the entire BAM (i.e. not using Eric's trick). Here is my command line: gatk -T BaseRecalibrator -I realigned_374.bam -R /export/share/disk501/lab0605/Bioinformatics/databases/bahlolab_db/mm10/mm10.fasta --knownSites /export/share/disk501/lab0605/Bioinformatics/databases/bahlolab_db/mm10/annot/20110602-final-snps.vcf.gz -o 374_recal_data.grp -l DEBUG -log BQSR_374.out -k I'm not able to attach the csv file to my post ("Uploaded file type is not allowed"). Instead I've uploaded it to Dropbox https://www.dropbox.com/s/ombr1wgosksiuz2/374_recal_data.grp.csv The full log file is here as well https://www.dropbox.com/s/51l595pqnmy6k6c/BQSR_374.out Also, the -k for BaseRecalibrator is not a documented at the command line, although it is mentioned at http://www.broadinstitute.org/gatk/guide/topic?name=methods-and-workflows Cheers Post edited by rpoplin on • Dev edited August 2012 The input csv file looks fine and it doesn't look like there were any errors reported this time in your log file: INFO 16:21:57,814 BaseRecalibrator - Generating recalibration plots... DEBUG 16:21:58,348 RScriptExecutor - Executing: DEBUG 16:21:58,349 RScriptExecutor - Rscript DEBUG 16:21:58,349 RScriptExecutor - -e DEBUG 16:21:58,350 RScriptExecutor - tempLibDir = '/tmp/Rlib.1049277187388468345';source('/tmp/BQSR.3375557281675659007.R'); DEBUG 16:21:58,350 RScriptExecutor - /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv DEBUG 16:21:58,351 RScriptExecutor - /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv.pdf DEBUG 16:22:02,286 RScriptExecutor - Result: 1 WARN 16:22:02,287 RScriptExecutor - RScript exited with 1  The pdf file should have showed up at /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv.pdf I think the problem must lie in some difference in the R setup between our systems. What if you execute the Rscript directly with this command: RscriptGATK/public/R/scripts/org/broadinstitute/sting/gatk/walkers/bqsr/BQSR.R /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv.pdf

Can you please post any command line output from that. When I run that command here locally using the csv you posted on dropbox I get the pdf (attached).

Thanks,

• Member

Thanks for the pdf, Ryan.

I think I've figured out the problem. I am using GATK v2.0 (Beta) (v2.0-39) downloaded from http://www.broadinstitute.org/gatk/download and this does not include the BQSR.R script. As far as I can tell, BQSR.R is only included in GATK-lite source that is available via GitHub.

I'm a little confused as to why BQSR.R is only available in the GitHub source and not in the directly downloadable GATK v2.0. Can you please explain whether I am overlooking something or the reason for this?

Thanks

• Dev

The BQSR.R script should be embedded inside the jar that was downloaded from that link. It gets pulled out and called with Rscript from within the BaseRecalibration walker during runtime. In order to call the Rscript directly, like in the command line I posted, you have to download the source from GitHub to get the source of the R script.

• Member
edited August 2012

I've found the problem and it's due to changes in ggplot2 for version >=0.9 and a missing grid package in R. Specifically, in BQSR.R the following is called:

scale_y_continuous(formatter="comma")

This will work for version < 0.9 of ggplot2 but not for versions >= 0.9; the current version of ggplot2 on CRAN is 0.9.1 and is the version I have installed. Full details of changes in ggplot2 for v >= 0.9 can be found at cloud.github.com/downloads/hadley/ggplot2/guide-col.pdf.

The required code for v >= 0.9 of ggplot2 is as follows:

# Preamble
library(scales) # This package is required in order to use comma_format(). It contains some of the functionality previously in ggplot2.
library(grid) # Required to use grid.layout (at least in R v2.15.1)
# ...
# scale_y_continuous(formatter="comma") replaced by scale_y_continuous(labels = comma_format()) in the following line:
d <- p + geom_histogram(aes(fill=Recalibration,weight=Observations),alpha=0.6,binwidth=1,position="identity") + scale_fill_manual(values=c("maroon1","blue")) + facet_grid(.~EventType) + scale_y_continuous(labels = comma_format())


With those alterations made to BQSR.R I can produce the pdf figure manually.

Thanks! I pushed a fix for this -- along with an expanded BQSR.R script -- to unstable, and it'll come along with the upcoming 2.1 release on Monday

• Member

No problem, Mark. One further improvement would be in the labelling of the x-axis tick marks for the "context covariate plots" - the three base suffixes are basically illegible in the plots I've seen.

• Member

@Mark_DePristo said:
Thanks! I pushed a fix for this -- along with an expanded BQSR.R script -- to unstable, and it'll come along with the upcoming 2.1 release on Monday

Hi,

I am using "GenomeAnalysisTK-2.1-12-ga99c19d" version and still not able to produce plots. Below is the command i have used,

"java -jar GATK_dir/GenomeAnalysisTK.jar -T BaseRecalibrator -I duprem.realigned.fixed.bam -RReference -knownSites $dbSNP_vcf/canFam2_snp131.vcf -o recal_data.grp" Any suugestions!!! • Member Hi there, I am also trying to get the pdf with the plots... I am using the version: v2.1-8-g5efb575 I did both step of the BaseRecalibrator and the PrintReads, I get a recalibrated BAM file and the gatkReport, but the "A PDF file containing quality control plots showing the patterns of recalibration of the data" is missing... Can you please help? The commands I used: java -Xmx4g -jar /home/matteo/Desktop/programs/pipelines/gatk/GenomeAnalysisTK-2.1/GenomeAnalysisTK.jar -T BaseRecalibrator -R ncbi_S288c.fasta -I output/$1.realigned.bam -o output/$1.recal_data.grp java -Xmx4g -jar /home/matteo/Desktop/programs/pipelines/gatk/GenomeAnalysisTK-2.1/GenomeAnalysisTK.jar -T BaseRecalibrator -R ncbi_S288c.fasta -I output/$1.realigned.bam -BQSR output/$1.recal_data.grp -o output/$1.grp

java -Xmx4g -jar /home/matteo/Desktop/programs/pipelines/gatk/GenomeAnalysisTK-2.1/GenomeAnalysisTK.jar -T PrintReads -R ncbi_S288c.fasta -I output/1.realigned.bam -BQSR output/1.grp -o output/$1.bam • Member I forgot to say that the BQSR.R is present embedded inside the jar • Cambridge, MAMember, Administrator, Broadie @ivernar and @mehar, did you see any message at all in the stack trace about the Rscript failing, or is it just silently not generating the plots? • Member Hi, I didnt see any message about the Rscript, I will run it with debug option and I'll let you know... • Member sorry, I found it: ScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info. so, I will run it with debug option and I'll let you know... • Member @ivernar - This sounds like the same error I encountered. I'm still unable to produce the BQSR plots automatically despite changes made from v2.0 to v2.1. My fix is to add the -k flag to my call to BaseRecalibrator (to retain the temporary csv file) and then run a modified version of the BQSR.r script over the csv file. I require a modified version of BQSR.r due to differences between my R configuration and the configuration/versions assumed by GATK. Here's an example call: GATK -T BaseRecalibrator -I${SAMPLE_ID}.realigned.bam -R ${REF} -knownSites${SNPS_VCF} -o ${SAMPLE_ID}.recal_data.grp -log${SAMPLE_ID}.BQSR.log Rscript BQSR_modified_by_PH.r ${SAMPLE_ID}_recal_data.grp.csv${SAMPLE_ID}_recal_data.grp.csv.pdf

I can't attach my BQSR_modified_by_PH.r script, however the changes are documented in my earlier comment.

• Member
edited November 2012

HI Geraldine_VdAuwera,

Several packages were unavailable initially and i have installed them manually. The latest error is" there is no package called 'gsalib". I tried to install 'gsalib' but could not find the source for R-2.15.1. Any fix for this?

Here are the few suspicious line from the error report of the BaseRecalibrator component,

perl: warning: Falling back to the standard locale ("C").
gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

gdata: to support 'XLSX' (Excel 2007+) files.

gdata: Run the function 'installXLSXsupport()'
gdata: libaries needed to support Excel XLS and XLSX formats.

Attaching package: 'gdata'

The following object(s) are masked from 'package:stats':

nobs

The following object(s) are masked from 'package:utils':

object.size

Attaching package: 'gplots'

The following object(s) are masked from 'package:stats':

lowess

Attaching package: 'reshape'

The following object(s) are masked from 'package:plyr':

rename, round_any

Error in library(gsalib) : there is no package called 'gsalib'
Calls: source -> withVisible -> eval -> eval -> library
Execution halted.

Post edited by Geraldine_VdAuwera on
• Member
edited November 2012

Geraldine_VdAuwera

ivernar and mehar, did you see any message at all in the stack trace about the Rscript failing, or is it just silently not generating the plots?

HI Geraldine_VdAuwera,

Several packages were unavailable initially and i have installed them manually. The latest error is" there is no package called 'gsalib". I tried to install 'gsalib' but could not find the source for R-2.15.1. Any fix for this?

Here are the few suspicious line from the error report of the BaseRecalibrator component,

perl: warning: Falling back to the standard locale ("C"). gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

gdata: Unable to load perl libaries needed by read.xls() gdata: to support 'XLSX' (Excel 2007+) files.

gdata: Run the function 'installXLSXsupport()' gdata: to automatically download and install the perl gdata: libaries needed to support Excel XLS and XLSX formats.

Attaching package: 'gdata'

The following object(s) are masked from 'package:stats':

nobs
The following object(s) are masked from 'package:utils':

object.size

Attaching package: 'gplots'

The following object(s) are masked from 'package:stats':

lowess

Attaching package: 'reshape'

The following object(s) are masked from 'package:plyr':

rename, round_any
Error in library(gsalib) : there is no package called 'gsalib' Calls: source -> withVisible -> eval -> eval -> library Execution halted.

Post edited by Geraldine_VdAuwera on

We're not sure why some of you are still having trouble getting the rscripts to work automatically. Unfortunately it seems to be something that varies with your R environment config, which I can't help you with.

However I can tell you that if you're trying to call gsalib independently of the GATK, you'll need to download the GATK source from github (see the Downloads page) and get gsalib from there. See this article for more details:

• Member
edited November 2012

Hi!
I am running the GATK v2.2-3 and I get my GATK Report but I am not able to get any pdf file from the BaseRecalibrator walker. I've also rerun BaseRecalibrator with the -l DEBUG' flag.
Here is the output that I think is relevant (are there duplicate row names?!):

Loading required package: methods
gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

Attaching package: 'gdata'

The following object is masked from 'package:stats':

nobs

The following object is masked from 'package:utils':

object.size

Attaching package: 'gplots'

The following object is masked from 'package:stats':

lowess

Attaching package: 'reshape'

The following object is masked from 'package:plyr':

rename, round_any

duplicate 'row.names' are not allowed
Calls: source ... withVisible -> eval -> eval -> read.csv -> read.table
Execution halted


Here is my command line (this is when I rerun BaseRecalibrator to get the pdf with the the original qualities along with the recalibrated qualities):

java -jar GenomeAnalysisTK.jar -I mySample.bam -BQSR mySample_first.grp  -R refHuman/human_g1k_v37.fasta -T BaseRecalibrator -o mySample_second.grp --plot_pdf_file mySample_first_second.grp.pdf  -knownSites:dbSNP,VCF dbsnp_137.b37.vcf -nct 12 -l DEBUG


Do you think is there a problem with my Bam file? When I validate my file using the Picard tools (ValidateSamFile.jar) I got only the following output:

ERROR: Record 44764671, Read name FCD0HJ5ACXX:4:1103:18118:188026#, MAPQ should be 0 for unmapped read.
ERROR: Record 44990092, Read name FCD0HJ5ACXX:4:2204:12662:139015#, MAPQ should be 0 for unmapped read.
ERROR: Record 45463864, Read name FCD0HJ5ACXX:4:1107:8307:101631#, MAPQ should be 0 for unmapped read.


Do you think this is causing the problem? I have no idea what to do to get the pdf files since in the current version I cannot use the -k flag to get the csv file.

Thanks for any help!

Post edited by Geraldine_VdAuwera on

You need to enable the creation of the pdf - it isn't done by default anymore. See the documentation for more details.

• Member

Thanks for the quick answer. It means that I shall add to my command line the flag --emit_original_quals true. Is it correct?

• Member
edited November 2012

But that is already in my command line. I posted it before but I will copy here again:

java -jar GenomeAnalysisTK.jar -I mySample.bam -BQSR mySample_first.grp -R refHuman/human_g1k_v37.fasta -T BaseRecalibrator -o mySample_second.grp --plot_pdf_file mySample_first_second.grp.pdf -knownSites:dbSNP,VCF dbsnp_137.b37.vcf -nct 12 -l DEBUG


Hmm, I missed that, sorry. Your command line looks fine, I guess it's the Rscript causing the problem. Not sure what's going on there. Have you tried loading the .grp file into R with gsalib?

• Member
edited November 2012

Yes, it works when I load the .grp file into R.

> library(gsalib)
> summary(d)
Length Class      Mode
Arguments   2      data.frame list
Quantized   3      data.frame list
RecalTable0 6      data.frame list
RecalTable1 6      data.frame list
RecalTable2 8      data.frame list

Post edited by Geraldine_VdAuwera on
• Member

I do not know if there is any problem with the GATK report because the .bam files I got after I did the quality score recalibration are smaller (I remember that in the GATK 2.1 version the sizes of the same files increased almost to the double). Just to give an example, sample1_realgn.bam (after running the tools RealignerTargetCreator and IndelRealigner) was about 3.3Gb. After I run the tools BaseRecalibrator and PrintReads the new bam file is about 1.5Gb. If I run ReduceReads on the last file, I get a file with about 1.4Gb. Am I doing something wrong?

• Member

Regarding the last post, I rerun the command line for BaseRecalibrator and PrintReads and I got the files with almost the double size as I expected (from 3.3Gb to 5.8Gb after BaseRecalibrator and PrintReads, and then 1.4Gb after ReduceReads for the sample I gave as example). There was a problem with the server I was using but I do not know how (and if) it did affect my jobs... However I still have the same problem with plots. I still not get them but I am still able to load my .grp file in R as I previously mentioned.

Ah, that confirms what I thought -- the bam file size is probably not relevant to your original problem. The debug message indicates that the R script failed, so that's most likely to be it. The question is why did the script fail. I'm consulting the team, we'll try to find the answer.

• Member

Hi,
I am also not able to generate plots with GATK2.2-3. I can load my .grp file into R with gsalib. However, running BaseRecalibrator, I get following error:

INFO  13:39:40,097 BaseRecalibrator - Generating recalibration plots...
DEBUG 13:39:40,161 RScriptExecutor - Executing:
DEBUG 13:39:40,161 RScriptExecutor -   Rscript
DEBUG 13:39:40,161 RScriptExecutor -   -e
DEBUG 13:39:40,162 RScriptExecutor -   tempLibDir = '/scratch/pbs.27501.ax3/tmp/Rlib.7344000831173984327';source('/scratch/pbs.27501.ax3/tmp/BQSR.4899331575441606758.R');
DEBUG 13:39:40,162 RScriptExecutor -   /scratch/pbs.27501.ax3/tmp/BQSR705839790581939999.csv
DEBUG 13:39:40,162 RScriptExecutor -   /ax3-cgi/analysis/1SF.recalibration_report.grp
DEBUG 13:39:40,163 RScriptExecutor -   /ax3-cgi/analysis/1SF.recalibration.pdf
WARNING: ignoring environment value of R_HOME
gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

Attaching package: ‘gdata’

The following object(s) are masked from ‘package:stats’:

nobs

The following object(s) are masked from ‘package:utils’:

object.size

Attaching package: ‘gplots’

The following object(s) are masked from ‘package:stats’:

lowess

Attaching package: ‘reshape’

The following object(s) are masked from ‘package:plyr’:

rename, round_any

more columns than column names
Execution halted
DEBUG 13:39:42,574 RScriptExecutor - Result: 1
WARN  13:39:42,575 RScriptExecutor - RScript exited with 1


Dalia

• Member
edited November 2012

I am getting the exact same error as Dalia, and I've tried both GATK 2.2-3 and 2.2-4. It seems the error is with reading the csv file:

Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
more columns than column names


When I run the csv file manually through the BQSR.R script from the GATK-Lite clone from Github, it works perfectly:

Rscript GATK/BQSR.R BQSR8622570015259924177.csv SC5314.sf.dedup.realigned.grp SC5314.sf.dedup.realigned.csv.pdf  Perhaps there is a difference between the BQSR.R script inside the full version of GATK2 vs GATK-Lite? Note: Since GATK 2.2-3 and 2.2-4 don't allow the -k option for BaseRecalibrator to retain the temporary csv file, I just wrote a script that checks the temp directory every second for a csv file and moves it to a different directory before it gets automatically deleted. Post edited by Geraldine_VdAuwera on • Broad InstituteMember, Broadie, Dev There is a "hidden" argument that allows you to specify a permanent location for the csv: --intermediate_csv_file. We can make it un-hidden in the next release if that's helpful. • Member Thanks for the tip Eric, making that argument visible might be helpful for debugging these sorts of issues. Upon further investigation, it looks like my csv file might be malformed rather than the R script having an issue. I'm having trouble attaching a file to my post, but here are the first few lines of the csv file: ReadGroup,CovariateValue,CovariateName,EventType,Observations,Errors,EmpiricalQuality,AverageReportedQuality,Accuracy,Recalibration 622RY.1,CAC,Context,Base Insertion,13250864.00,857.001559,41.89,45.00,-3.11,ORIGINAL 622RY.1,CAC,Context,Base Deletion,13250864.00,1,796.287276,38.68,45.00,-6.32,ORIGINAL 622RY.1,AA,Context,Base Substitution,148573581.00,424,808.187498,25.44,26.56,-1.12,ORIGINAL 622RY.1,AAA,Context,Base Insertion,51228232.00,8,346.105600,37.88,45.00,-7.12,ORIGINAL 622RY.1,AAA,Context,Base Deletion,51228232.00,25,254.920299,33.07,45.00,-11.93,ORIGINAL  There are 10 items in the header, 10 in the first row but 11 in the second row. Looking through the rest of the file, it looks like most rows have 11 items. For the rows that have 10 items, it looks like the column that is currently called "Errors" is missing. These are just my observations; I have never looked at these covariate csv files before so I'm not sure exactly what they are supposed to look like. Could there possibly be a bug in the code that generates these csv files? • Broad InstituteMember, Broadie, Dev Excellent detective work! You have uncovered a major bug in the latest release of the code. What is happening is that for the Errors column the number is printed out with commas when great than 999 (which is bad in a comma-delimited file). I will push in a fix for this ASAP and it should be available in a few hours. • Member Has this Issue of creating a plot from BaseRecalibrator been solved?? • Cambridge, MAMember, Administrator, Broadie @shahr2, several different issues have been addressed and resolved in this thread. If you are having trouble getting plots generated, see if any of these apply to you and if so, apply the corresponding solution (a common one is upgrading to the latest version of GATK). If your problem is not resolved here, feel free to post a new question, with details (error messages if applicable, etc). However, be advised that in many cases the problem is the user's R configuration, which is not something we can provide support for. • Member It appears to be that a new version of ggplot2 is released: 0.9.3 After I read this thread I managed to get he plots generated. However, in the debug it states the following: 'opts' is deprecated. Use 'theme' instead. (Deprecated; last used in version 0.9.1) 'theme_blank' is deprecated. Use 'element_blank' instead. (Deprecated; last used in version 0.9.1) 'theme_blank' is deprecated. Use 'element_blank' instead. (Deprecated; last used in version 0.9.1) 'theme_blank' is deprecated. Use 'element_blank' instead. (Deprecated; last used in version 0.9.1) 'theme_blank' is deprecated. Use 'element_blank' instead. (Deprecated; last used in version 0.9.1) 'opts' is deprecated. Use 'theme' instead. (Deprecated; last used in version 0.9.1) theme_text is deprecated. Use 'element_text' instead. (Deprecated; last used in version 0.9.1) 'opts' is deprecated. Use 'theme' instead. (Deprecated; last used in version 0.9.1) theme_text is deprecated. Use 'element_text' instead. (Deprecated; last used in version 0.9.1) 'opts' is deprecated. Use 'theme' instead. (Deprecated; last used in version 0.9.1) theme_text is deprecated. Use 'element_text' instead. (Deprecated; last used in version 0.9.1) 'opts' is deprecated. Use 'theme' instead. (Deprecated; last used in version 0.9.1) theme_text is deprecated. Use 'element_text' instead. (Deprecated; last used in version 0.9.1) Warning messages: 1: NAs introduced by coercion 2: NAs introduced by coercion  Is this an issue in the perfomance how the plots are generated? does this case any differences? As i cannot compare it to any previous versions of ggplot2. Thanks in advance. Niels • Cambridge, MAMember, Administrator, Broadie This should not affect the generation of the plots. • Member edited January 2013 I have encountered the same problem and solved the issue by installing the missing R packages indicated by running -l DEBUG flag. To summary, in order to generate pdf files, the following R packages have to be installed: • bitops • caTools • colorspace • gdata • ggplot2 • gplots • gtools • RColorBrewer • reshape Hope this can help. • Cambridge, MAMember, Administrator, Broadie Thanks for sharing your experience, @silentio • Reykjavik, IcelandMember Thank you for the above information. I've installed the above packages (within R by '> install.packages(c("bitops"))' etc) but I can't figure out how to get the pdf. I'm running GATK v2.3-9-ge5ebf34 (not from queue). Do I get the following steps right? I first run the BaseRecalibrator to get a grp-output (paths simplified here): java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I myfile.realigned.bam -R path/to/hg19.fa -knownSites path/to/dbsnp_137.hg19.vcf -knownSites path/to/Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites path/to/1000G_phase1.indels.hg19.vcf -o myfile.recal_data.grp And then I run this command again with the -BQSR part (and a different -o filename): java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I myfile.realigned.bam -R path/to/hg19.fa -knownSites path/to/dbsnp_137.hg19.vcf -knownSites path/to/Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites path/to/1000G_phase1.indels.hg19.vcf -BQSR myfile.recal_data.grp -o myfile2.recal_data.grp The second step only adds the second grp file, almost identical to the first one. There seems to be no problem with the PrintReads step afterwards, it's just that it would be nice also to have the plots. Could this have something to do with how I installed the GATK? I downloaded the v2.3-9 (not Lite) and just mouse clicked the bz2-file. Should I have used some ant command(s) ? • Cambridge, MAMember, Administrator, Broadie Yes, you need to get the source download from the repository, build gsalib (do ant gsalib) (it's an R library we use to read in gatkreports, aka .grp files) and install it into your R environment. Sorry if this is unclear from the docs; we're trying to think of a better way to provide this library in future. • Reykjavik, IcelandMember Thanks Geraldine. I'm sorry to load you with more questions but I'm not too familiar with computer science language so could you tell me if I understood your instructions correctly? I did the following: I visited the Downloads page. According to your explanations on http://gatkforums.broadinstitute.org/discussion/1244 there should've been further instructions on the Downloads page but I didn't find them so I guessed how to continue. The Downloads page has a link (under GATK Lite) to Github, opening the Commit History page. I found that if I mouse-clicked "Tags" close to the upper right corner, it opened a list of code links. I selected the top "Source code (tar.gz)" (version 2.3) and moved the download to my GATK folder, where I double-clicked it and continued with the unix commands: cd path/to/GATK/gatk-2.3 ant gsalib ant It seemed to install normally, only warning that the walker FindCoveredIntervals was currently undocumented. Does this mean that I've installed the latest full version GATK (not GATK Lite)? The java ... -h command doesn't reveal a version like "v2.3-9-ge5ebf34" but only writes "vexported" instead. I'm indeed uncertain if I did this right. Should I have used more ant commands? Long ago when I ran earlier versions I could download and install updates within my path/to/GATK/destination_directory by using the unix commands "git pull; ant clean; ant". It seems I can't do this any longer. Should I have included "ant clean" when I now installed 2.3? Or more / other ant commands? Please tell if you see something I should have done simpler or more correctly. I fear that I even selected a completely wrong code link path from the Downloads page. With great appreciation for your tremendous work with developing gatk, making it available and helping inexperienced people like me. Thanks! • Cambridge, MAMember, Administrator, Broadie You're on the right track. Those commands don't really "install" anything, they just compile some program files. If you didn't overwrite the GenomeAnalysisTK.jar file that you already had, then you should be able to invoke that version just as you did before. What you still need to do is add gsalib to your R environment. To do that, go to the directory where you did ant gsalib, start R and do install.packages("gsalib"). After that you'll be able to load the gsalib library manually if you want to process gatkreports yourself, and the GATK will be able to call the necessary rscripts to generate plots of base recalibration. If you want to be able to update the GATK using git pull etc. when we release a new version, you need to "clone the repository" -- there should be a button to do that on the github page. Right now this is only possible for GATK Lite, but starting next week when we release version 2.4, you'll be able to do it for the full GATK (as long as you use GATK for non-commercial work). • Reykjavik, IcelandMember Great! Although I didn't have luck with install.packages("gsalib"). It always directs to a Cran mirror and then says "package ‘gsalib’ is not available (for R version 2.15.2)". I tried from within the directory you suggested (path/to/GATK/gatk-2.3, where I had done ant gsalib), and also from an R subdirectory further down (path/to/GATK/gatk-2.3/public/R/src/org/broadinstitute/sting/utils/R/gsalib/R - did that make sense?). I also tried library(gsalib) in line with your explanations on http://gatkforums.broadinstitute.org/discussion/1244 and R accepted that from within the R-subdirectory (and also from elsewhere if I made a .Rprofile reference to that subdirectory). I'm still stuck with the gsalib installation but I guess this is a problem with my knowledge of R rather than gatk and I can't really expect you to spend time helping me with that. Just if you see a very simple solution ... But I look forward to trying the full 2.4 next week. And it's good to know I can move forward (with or without the gsalib) using the GenomeAnalysisTK.jar file I had before. Thanks again for your explanation and help! • Boston, MAMember To install the R gsalib library, you can do: sudo R CMD INSTALL --preclean public/R/src/org/broadinstitute/sting/utils/R/gsalib  this is the equivalent of what ant gsalib does. That takes care of installation into R libraries. You can confirm by doing a library(gsalib) from an interactive R session. Hope this helps. • Reykjavik, IcelandMember Yes it helped! Now R can run library(gsalib) from any directory. Thanks! • Reykjavik, IcelandMember A little addition on my attempt to use the gsalib: I seem to be able to view .grp files from within R (after >library(gsalib)) since if I copy the example from http://gatkforums.broadinstitute.org/discussion/1244 into a "test.grp" file and do > d = gsa.read.gatkreport('test.grp') and > summary(d) it perfectly displays the example table. (Just to mention: the filename had to be shown within ' ' rather than ""). But the v2.3-9-ge5ebf34 BaseRecalibrator still makes a .grp file without pdf-tables. Maybe I misunderstood how it should work (or it's confined to the Sting/Lite)? And if I try (as with the test.grp file) to look at my BaseRecalibrator .grp file in R, it results in a 'colnames' error' (...'names' attribute [6] must be the same length as the vector [1]). Well, perhaps that's an inappropriate twist to the discussions in this thread, and maybe it's also irrelevant since 2.4 is on its way soon. One more question however: Should I try adding BQSR.R, cf. the answer by rpoplin at the top of this page? If so, is there a particular path I should make and copy it to (related to my location of GenomeAnalysisTK.jar after opening the .bz2 download)? • Cambridge, MAMember, Administrator, Broadie I just checked the command lines you posted earlier; it looks like you're not specifying plot filenames. If you don't specify them the tool doesn't generate the plots. See the technical documentation to find which arguments control the plots. • NewcastleMember edited January 2013 Just to get back on this last post by adalgeir I've downloaded the latest version 2.3-9 I have all the R dependencies installed including gsalib. I too also see the same error traced with -l DEBUG: Error in 'colnames<-'('*tmp*', value = c("ReadGroup", "EventType", "EmpiricalQuality", : 'names' attribute [6] must be the same length as the vector [1] Calls: source ... finishTable -> .gsa.assignGATKTableToEnvironment -> colnames<- Execution halted  I'm running with my own data here: java -Xmx28g -jar ~/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -nct 9 -l DEBUG -I lane_1_sorted.bam -L chr1:1-2000 -R /data/GATK_bundle/hg19/ucsc.hg19.fasta -knownSites dbsnp_137.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites 1000G_phase1.indels.hg19.vcf -o TEST_Lane_1_recal_data.grp --plot_pdf_file TEST_lane_1_recal.pdf --intermediate_csv_file TEST_lane_1.cvs  Is there a bug in the R code? Or have I done something wrong? • Cambridge, MAMember, Administrator, Broadie This is an error that pops up occasionally for some people and we're not sure why. Can you make sure that you have the latest version of gsalib installed? That doesn't get updated when you update the GATK itself. • NewcastleMember edited January 2013 The version of gsalib I'm using is one I built using ant from the GATK source on github just this week i.e. it should be current. • Cambridge, MAMember, Administrator, Broadie Matt, is this occurring when the script gets run by BaseRecalibrator, or when you try to run it yourself? • Reykjavik, IcelandMember Thanks again Geraldine and apologies for not noticing the need of -plot specification. Now I've reached to the same stage as Matt. When I run BaseRecalibrator (the full version download of GenomeAnalysisTK.jar), it's exactly like Matt describes (the same error when run with -l DEBUG). I'll wait and see how this thread continues (I have to leave my office for a while). Thanks, both of you. • NewcastleMember edited January 2013 It's occurring when BaseRecalibrator is calling Rscipt. • NewcastleMember edited February 2013 Right so removing -L chr1:1-2000 looks to have fixed my issue, the plots have been plotted by BaseRecalibrator calling Rscript, I wonder if there was insufficient data to produce them form such a small run. Post edited by MattB on • Cambridge, MAMember, Administrator, Broadie Hi Matt, That's right, if you are running on a very small amount of data you end up with some empty tables in the gatkreport and so the script fails when it tries to process them. Sorry I didn't notice the -L in your command line. • Member Hello I had the same problems posted before, and decided to conglomerate very good tips posted by all the users above, in order to solve problems while dealing with the .pdf files generation, sorry if I forget mentioning someone. First I downloaded the Source Code: wget 'https://github.com/broadgsa/gatk/archive/master.zip' Then unzip master.zip Enter the gatk-master folder, execute cd /path/to/gatk-master ant gsalib  I wish you've got a "BUILD SUCCESSFUL" after that. Now let's enter the R environment, in order to install some key packages, as @silentio mentioned: R >install.packages("bitops") >install.packages("caTools") >install.packages("colorspace") >install.packages("gtools") >install.packages("gplots") >install.packages("gdata") >install.packages("ggplot2") >install.packages("RColorBrewer") >install.packages("reshape")  now let's take a look at BaseRecalibrator: java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R genome.fa -I file.bam -knownSites sites.vcf -o recal.grp -intermediate recal.grp.csv the -intermediate  option is an important part of the process, as noticed by @ebanks. With this intermediate .csv file, we can call the Rscript to generate the plots, as @dkvitek did: Rscript path/to/gatk-master/public/R/scripts/org/broadinstitute/sting/utils/recalibration/BQSR.R recal.grp.csv recal.grp recal.grp.pdf after some logs, you should have you .pdf file. You can take a look at your .pdf file by executing evince recal.grp.pdf Run the post recalibration providing -BQSR and once again run the Rscript Done! Maybe some dependent R packages will be needed, but R will ask you to install them. Thanks! • Cambridge, MAMember, Administrator, Broadie @murilogborges, thanks for putting this together! As a caveat, please note that once you've configured your R environment as described, the GATK should be able to run the Rscripts for you, so you shouldn't need to run them manually nor keep intermediate files. However for those users who cannot seem to get that to work, this method should be a workable substitute. • Member Hi Geraldine, I'm getting similar errors and have tracked it down to the ggplot library. Before I wake up my cluster administrator (I don't have rights to install R libs), can you please confirm that this is still a problem with v 2.4-9. If yes, do you have pointers as to which libraries and which R version needs to be installed? Many thanks in advance! • Cambridge, MAMember, Administrator, Broadie Hi Klaus, This hasn't changed, so you'll need to wake up your admin, sorry! The R version shouldn't be very important -- fyi I have 2.13.1 on my machine. You'll need to install ggplot2 and its dependencies, which can be obtained from CRAN in the usual way, and our own library, gsalib, which at the moment is distributed with the source code on Github. You'll need to download the source and build GATK (ant clean dist), and the gsalib will be in one of the resource directories created by the build. I realize this is rather inconvenient but we are planning to put gsalib in CRAN fairly soon to make it all easier. • Member Hi I have used the following commandline for Baserecalibration in my cluster java -Xmx30g -jar GenomeAnalysisTK.jar -T BaseRecalibrator --default_platform illumina -I Sample_realigned.bam
-R ./reference/genome.fa -knownSites snps_indels.vcf -knownSites indels.vcf -o Sample_rcal.grp
-plots Sample_rcal.pdf -S LENIENT

after analysis I have the Sample_rcal.grp file but no plots Sample_rcal.pdf is generated
In my cluster R enviroment there is no gsalib, gpplot2,gplots libraries so, I installed all required pakages on my local windows R environment and I can read the GATK report as follows-

>library(gsalib)

> summary(d)

Length Class      Mode
Arguments   2      data.frame list
Quantized   3      data.frame list
RecalTable0 6      data.frame list
RecalTable1 6      data.frame list
RecalTable2 8      data.frame list


But I need to generate the plot file , How can I do this on my own machine ?

Any suggestion

Thanks

Ashutsoh

edited April 2013

If you want to run the script that generates the plots on your own machine, you will need to add the -intermediate argument to your BaseRecalibrator command in order to save the intermediate .csv file (otherwise it is deleted as a temp file). Then, copy that file to your local machine. You will need also to get the GATK source from Github because you need the BQSR.R script. Once you have that, you adapt this command with the appropriate filenames and paths, and run it to generate the plots:

Rscript \$GATK/public/R/scripts/org/broadinstitute/sting/gatk/walkers/bqsr/BQSR.R intermediate_file.csv output_plot.pdf

• Member
edited June 2013

I am also not able to generate the bqsr plot. Here is the relevant debug output

DEBUG 16:33:38,952 QualQuantizer - updated intervals: [QQ:0-9, QQ:10-12, QQ:13-14, QQ:15-15, QQ:16-16, QQ:17-17, QQ:18-18, QQ:19-19, QQ:20-20, QQ:21-21, QQ:22-22, QQ:23-29, QQ:30-30, QQ:31-31, QQ:32-32, QQ:33-93]
INFO  16:33:38,953 BaseRecalibrator - Writing recalibration report...
INFO  16:33:41,182 BaseRecalibrator - ...done!
INFO  16:33:41,183 BaseRecalibrator - Generating recalibration plots...
DEBUG 16:33:41,183 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 5, 1012, 3]
DEBUG 16:33:41,183 NestedIntegerArray - Pre-allocating first 2 dimensions
DEBUG 16:33:41,183 NestedIntegerArray - Done pre-allocating first 2 dimensions
DEBUG 16:33:41,262 RScriptExecutor - Executing:
DEBUG 16:33:41,262 RScriptExecutor -   Rscript
DEBUG 16:33:41,262 RScriptExecutor -   -e
DEBUG 16:33:41,263 RScriptExecutor -   tempLibDir = '/tmp/Rlib.2327386667045653425';source('/tmp/BQSR.6943530625249500071.R');
DEBUG 16:33:41,263 RScriptExecutor -   /tmp/BQSR5693155957981819144.csv
DEBUG 16:33:41,263 RScriptExecutor -   /share/data/skhan/genome1_25_soyb/HN001/test_bqsr.grp
DEBUG 16:33:41,263 RScriptExecutor -   /share/data/skhan/genome1_25_soyb/HN001/test_bqsr.pdf
DEBUG 16:33:44,144 RScriptExecutor - Result: 1
WARN  16:33:44,144 RScriptExecutor - RScript exited with 1
INFO  16:33:44,145 BaseRecalibrator - Processed: 3913 reads
INFO  16:33:44,148 ProgressMeter -            done        3.92e+03    7.0 s       30.2 m     99.3%         7.0 s     0.0 s
INFO  16:33:44,148 ProgressMeter - Total runtime 7.10 secs, 0.12 min, 0.00 hours
INFO  16:33:44,149 MicroScheduler - 1643 reads were filtered out during traversal out of 5561 total (29.55%)
INFO  16:33:44,149 MicroScheduler -   -> 425 reads (7.64% of total) failing DuplicateReadFilter
INFO  16:33:44,149 MicroScheduler -   -> 1203 reads (21.63% of total) failing MappingQualityZeroFilter
INFO  16:33:44,150 MicroScheduler -   -> 15 reads (0.27% of total) failing NotPrimaryAlignmentFilter
DEBUG 16:33:44,153 GATKRunReport - Aggregating data for run report
DEBUG 16:33:44,613 GATKRunReport - Posting report of type STANDARD
DEBUG 16:33:44,615 GATKRunReport - Generating GATK report to AWS S3 with key TzWc7LoRH2GTCLSCKnfs3YFwXgS3edx2.report.xml.gz
INFO  16:33:45,263 GATKRunReport - Uploaded run statistics report to AWS S3
DEBUG 16:33:45,264 GATKRunReport - Uploaded to AWS: S3Object [key=TzWc7LoRH2GTCLSCKnfs3YFwXgS3edx2.report.xml.gz, bucket=<Unknown>, lastModified=Mon Jun 24 16:33:46 CDT 2013, dataInputStream=null, Metadata={ETag="ad272e580f30fbce0c3629f
b6974b17f", Date=Mon Jun 24 16:33:46 CDT 2013, Content-Length=411, id-2=g5TacGPGNiZCpywuXaQZEwuUZoorzRV1pakmfJRuZBRPmfMdf4WlJ12tmnkdAfKi, request-id=80E37BAB3D7499B4, Content-MD5=rScuWA8w+84MNin7aXSxfw==, Content-Type=application/octet-
stream}]

Post edited by Geraldine_VdAuwera on

Hi @smk_84,

I'm not sure why the plotting didn't work for you here. Did you generate plots successfully previously? And what version are you using?

• Member

I did actually but it was on another server.
I am using version 2.5-2

Hmm. Your local environment may not be correctly set up on the new server. You could try running again with the option to keep the intermediate csv file, and run the BQSR.R script (which you can get from the source code repository on github) directly on the csv file. If it doesn't work it should at least yield more informative error message.

• Member

Why there's no RMSE value in generated plots?...@Geraldine

Hi @beezone,

• Member

Hello @ Geraldine, I am able to generate BQSR plot successfully using BaseRecalibrator and got some really nice recalibration plots such plots for Reported quality vs Empirical quality score. But what I am looking for is RMSE value of such plot, which is not shown/present in plot that I have generated. Please, have a look at this thread (http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr), I hope this will clarify about RMSE value that I am talking about.

Oh, I see what you mean now. The publicly available R script that is used for plotting as part of the standard pipeline doesn't include calculation of the RMSE, sorry. The plots with RMSE value that are shown in that thread are produced with internal scripts. I'll see if we can make those public but I can't promise anything.

• Member

Hello Geraldine. Sorry for the strange issue. I don't think it worths a separate discussion and it's quite minor, but maybe somehow related to this topic. I am also trying to get plots but what I am curious about: why the error which occurs while Rscript is running is not reported in the log file (logging level : INFO). So when checking run on the way I see:

...
INFO 14:55:54,431 AnalyzeCovariates - Generating csv file 'sample1.BQSR_test.csv'
INFO 14:55:54,869 AnalyzeCovariates - Generating plots file 'sample1.BQSR_plots_test.pdf'
INFO 14:55:56,688 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

But the log file ends with just:

...
INFO 14:55:54,431 AnalyzeCovariates - Generating csv file 'sample1.BQSR_test.csv'
INFO 14:55:54,869 AnalyzeCovariates - Generating plots file 'sample1.BQSR_plots_test.pdf'
INFO 14:55:56,688 GATKRunReport - Uploaded run statistics report to AWS S3

Do you think there is a way to get a full report in log file? Otherwise it looks a bit confusing when checking afterwards...
Maybe it's something special about running Rscript and not GATK walker?

Thank you!

@Eugenie‌

Hi,

You can try setting the logging level to ERROR to see if that helps.

-Sheila

• Member

Thank you, Sheila!

• Sun Yat-Sen UniversityMember

java -jar /data3/xxw/00-GATK/GenomeAnalysisTK.jar -T AnalyzeCovariates -R /data3/xxw/00-GATK/ucsc.hg19.fasta -before 1A.recal.06-1.grp -after 1A.recal.06-2.grp -csv BQSR.csv.
i can't get the pdf picture

@xxw Please have a look at the documentation-- there is an article that addresses this problem.

• Sun Yat-Sen UniversityMember

sorry, can you show me the link of article? Thank you very much.

• Sun Yat-Sen UniversityMember

Rscript /data3/xxw/00-GATK/BQSR.R BQSR.csv BQSR.pdf

Attaching package: 'gplots'

The following object is masked from 'package:stats':

lowess
`

Error in file(filename, "r", blocking = TRUE) :
cannot open the connection
In file(filename, "r", blocking = TRUE) :
cannot open file 'BQSR.pdf': No such file or directory
Execution halted