We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
No plots generated by the BaseRecalibrator walker

I am using GATK v2 (GenomeAnalysisTK-2.0-0-g4c0ffd4) and was trying out the new BaseRecalibrator
walker. According to this post the BaseRecalibrator
should output "A PDF file containing quality control plots showing the patterns of recalibration of the data", however I do not have any such file. Both the BaseRecalibrator
and PrintReads
steps of the BQSR pipeline appear to have worked as I have a recalibrated BAM file and the accompanying GATKReport
but I would like to be able to view plots of the recalibration process (and preferably have these generated automatically by the recalibration pipeline).
Can you please help?
Thanks
Best Answer
-
rpoplin ✭✭✭
The BQSR.R script should be embedded inside the jar that was downloaded from that link. It gets pulled out and called with Rscript from within the BaseRecalibration walker during runtime. In order to call the Rscript directly, like in the command line I posted, you have to download the source from GitHub to get the source of the R script.
Answers
Hi there,
The most common cause of this issue is not having Rscript in the environment path. Can you run "which Rscript" and see what it says. If that looks fine then there may be a warning message in the log output of the BaseRecalibrator run that might help us to debug.
Cheers,
I can confirm that Rscript is in my path. I'll need to re-run BaseRecalibrator in order to get the log as I foolishly didn't save it to file. I'll post that once I have done so.
Cheers
Note that you can just run the tool on a tiny interval (e.g. 1:1-1000) if you want to regenerate the error without waiting for it to rerun over the whole genome.
I've rerun BaseRecalibrator with the `-l DEBUG' flag. Here is (I think) the relevant output:
INFO 08:20:48,388 BaseRecalibrator - Generating recalibration plots... DEBUG 08:20:48,827 RScriptExecutor - Executing: DEBUG 08:20:48,827 RScriptExecutor - Rscript DEBUG 08:20:48,827 RScriptExecutor - -e DEBUG 08:20:48,828 RScriptExecutor - tempLibDir = '/tmp/Rlib.2011861167067641148';source('/tmp/BQSR.1384038019643928030.R'); DEBUG 08:20:48,829 RScriptExecutor - /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv DEBUG 08:20:48,829 RScriptExecutor - /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv.pdf Loading required package: methods Error in continuous_scale(c("y", "ymin", "ymax", "yend", "yintercept", : unused argument(s) (formatter = "comma") Calls: source ... eval -> eval -> scale_y_continuous -> continuous_scale In addition: Warning messages: 1: NAs introduced by coercion 2: NAs introduced by coercion Execution halted DEBUG 08:20:53,553 RScriptExecutor - Result: 1 WARN 08:20:53,553 RScriptExecutor - RScript exited with 1
I can send the full log file if required.
Cheers
Hi there,
It looks like the data being sent to the plotting script is somehow malformed. Can you post the output of this file:
/export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv
Even the first 20 lines or so would be helpful to debug.
Out of curiosity did you happen to run with Eric's trick of -L 1:1-1000 or is this error also from running over a larger region?
Thanks!
Hi Ryan,
Hmm, the plot thickens. I re-ran BaseRecalibrator and still did not get a pdf plot, however the error message changed. Both attempts to use the BaseRecalibrator were using the entire BAM (i.e. not using Eric's trick).
Here is my command line:
gatk -T BaseRecalibrator -I realigned_374.bam -R /export/share/disk501/lab0605/Bioinformatics/databases/bahlolab_db/mm10/mm10.fasta --knownSites /export/share/disk501/lab0605/Bioinformatics/databases/bahlolab_db/mm10/annot/20110602-final-snps.vcf.gz -o 374_recal_data.grp -l DEBUG -log BQSR_374.out -k
I'm not able to attach the csv file to my post ("Uploaded file type is not allowed"). Instead I've uploaded it to Dropbox https://www.dropbox.com/s/ombr1wgosksiuz2/374_recal_data.grp.csv
The full log file is here as well https://www.dropbox.com/s/51l595pqnmy6k6c/BQSR_374.out
Also, the
-k
forBaseRecalibrator
is not a documented at the command line, although it is mentioned at http://www.broadinstitute.org/gatk/guide/topic?name=methods-and-workflowsCheers
The input csv file looks fine and it doesn't look like there were any errors reported this time in your log file:
INFO 16:21:57,814 BaseRecalibrator - Generating recalibration plots... DEBUG 16:21:58,348 RScriptExecutor - Executing: DEBUG 16:21:58,349 RScriptExecutor - Rscript DEBUG 16:21:58,349 RScriptExecutor - -e DEBUG 16:21:58,350 RScriptExecutor - tempLibDir = '/tmp/Rlib.1049277187388468345';source('/tmp/BQSR.3375557281675659007.R'); DEBUG 16:21:58,350 RScriptExecutor - /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv DEBUG 16:21:58,351 RScriptExecutor - /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv.pdf DEBUG 16:22:02,286 RScriptExecutor - Result: 1 WARN 16:22:02,287 RScriptExecutor - RScript exited with 1
The pdf file should have showed up at
/export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv.pdf
I think the problem must lie in some difference in the R setup between our systems. What if you execute the Rscript directly with this command:
Rscript $GATK/public/R/scripts/org/broadinstitute/sting/gatk/walkers/bqsr/BQSR.R /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv /export/share/disk501/lab0605/hickey/Philippe_Bouillet/IGL00607-613/BAM/374_recal_data.grp.csv.pdf
Can you please post any command line output from that. When I run that command here locally using the csv you posted on dropbox I get the pdf (attached).
Thanks,
Thanks for the pdf, Ryan.
I think I've figured out the problem. I am using GATK v2.0 (Beta) (v2.0-39) downloaded from http://www.broadinstitute.org/gatk/download and this does not include the
BQSR.R
script. As far as I can tell,BQSR.R
is only included in GATK-lite source that is available via GitHub.I'm a little confused as to why
BQSR.R
is only available in the GitHub source and not in the directly downloadable GATK v2.0. Can you please explain whether I am overlooking something or the reason for this?Thanks
The BQSR.R script should be embedded inside the jar that was downloaded from that link. It gets pulled out and called with Rscript from within the BaseRecalibration walker during runtime. In order to call the Rscript directly, like in the command line I posted, you have to download the source from GitHub to get the source of the R script.
I've found the problem and it's due to changes in ggplot2 for version >=0.9 and a missing
grid
package in R. Specifically, inBQSR.R
the following is called:scale_y_continuous(formatter="comma")
This will work for version < 0.9 of ggplot2 but not for versions >= 0.9; the current version of ggplot2 on CRAN is 0.9.1 and is the version I have installed. Full details of changes in ggplot2 for v >= 0.9 can be found at cloud.github.com/downloads/hadley/ggplot2/guide-col.pdf.
The required code for v >= 0.9 of ggplot2 is as follows:
With those alterations made to
BQSR.R
I can produce the pdf figure manually.Thanks! I pushed a fix for this -- along with an expanded BQSR.R script -- to unstable, and it'll come along with the upcoming 2.1 release on Monday
No problem, Mark. One further improvement would be in the labelling of the x-axis tick marks for the "context covariate plots" - the three base suffixes are basically illegible in the plots I've seen.
Hi,
I am using "GenomeAnalysisTK-2.1-12-ga99c19d" version and still not able to produce plots. Below is the command i have used,
"java -jar $GATK_dir/GenomeAnalysisTK.jar -T BaseRecalibrator -I duprem.realigned.fixed.bam -R $Reference -knownSites $dbSNP_vcf/canFam2_snp131.vcf -o recal_data.grp"
Any suugestions!!!
Hi there,
I am also trying to get the pdf with the plots...
I am using the version: v2.1-8-g5efb575
I did both step of the BaseRecalibrator and the PrintReads, I get a recalibrated BAM file and the gatkReport, but the "A PDF file containing quality control plots showing the patterns of recalibration of the data" is missing...
Can you please help?
The commands I used:
java -Xmx4g -jar /home/matteo/Desktop/programs/pipelines/gatk/GenomeAnalysisTK-2.1/GenomeAnalysisTK.jar -T BaseRecalibrator -R ncbi_S288c.fasta -I output/$1.realigned.bam -o output/$1.recal_data.grp
java -Xmx4g -jar /home/matteo/Desktop/programs/pipelines/gatk/GenomeAnalysisTK-2.1/GenomeAnalysisTK.jar -T BaseRecalibrator -R ncbi_S288c.fasta -I output/$1.realigned.bam -BQSR output/$1.recal_data.grp -o output/$1.grp
java -Xmx4g -jar /home/matteo/Desktop/programs/pipelines/gatk/GenomeAnalysisTK-2.1/GenomeAnalysisTK.jar -T PrintReads -R ncbi_S288c.fasta -I output/$1.realigned.bam -BQSR output/$1.grp -o output/$1.bam
I forgot to say that the BQSR.R is present embedded inside the jar
@ivernar and @mehar, did you see any message at all in the stack trace about the Rscript failing, or is it just silently not generating the plots?
Hi,
I didnt see any message about the Rscript,
I will run it with debug option and I'll let you know...
sorry, I found it: ScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info.
so, I will run it with debug option and I'll let you know...
@ivernar - This sounds like the same error I encountered. I'm still unable to produce the BQSR plots automatically despite changes made from v2.0 to v2.1. My fix is to add the
-k
flag to my call to BaseRecalibrator (to retain the temporary csv file) and then run a modified version of theBQSR.r
script over the csv file. I require a modified version ofBQSR.r
due to differences between my R configuration and the configuration/versions assumed by GATK.Here's an example call:
GATK -T BaseRecalibrator -I ${SAMPLE_ID}.realigned.bam -R ${REF} -knownSites ${SNPS_VCF} -o ${SAMPLE_ID}.recal_data.grp -log ${SAMPLE_ID}.BQSR.log Rscript BQSR_modified_by_PH.r ${SAMPLE_ID}_recal_data.grp.csv ${SAMPLE_ID}_recal_data.grp.csv.pdf
I can't attach my
BQSR_modified_by_PH.r
script, however the changes are documented in my earlier comment.HI Geraldine_VdAuwera,
Several packages were unavailable initially and i have installed them manually. The latest error is" there is no package called 'gsalib". I tried to install 'gsalib' but could not find the source for R-2.15.1. Any fix for this?
Here are the few suspicious line from the error report of the BaseRecalibrator component,
HI Geraldine_VdAuwera,
Several packages were unavailable initially and i have installed them manually. The latest error is" there is no package called 'gsalib". I tried to install 'gsalib' but could not find the source for R-2.15.1. Any fix for this?
Here are the few suspicious line from the error report of the BaseRecalibrator component,
We're not sure why some of you are still having trouble getting the rscripts to work automatically. Unfortunately it seems to be something that varies with your R environment config, which I can't help you with.
However I can tell you that if you're trying to call gsalib independently of the GATK, you'll need to download the GATK source from github (see the Downloads page) and get gsalib from there. See this article for more details:
http://gatkforums.broadinstitute.org/discussion/1244
Hi!
I am running the GATK v2.2-3 and I get my GATK Report but I am not able to get any pdf file from the BaseRecalibrator walker. I've also rerun BaseRecalibrator with the `-l DEBUG' flag.
Here is the output that I think is relevant (are there duplicate row names?!):
Here is my command line (this is when I rerun BaseRecalibrator to get the pdf with the the original qualities along with the recalibrated qualities):
Do you think is there a problem with my Bam file? When I validate my file using the Picard tools (ValidateSamFile.jar) I got only the following output:
Do you think this is causing the problem? I have no idea what to do to get the pdf files since in the current version I cannot use the -k flag to get the csv file.
Thanks for any help!
You need to enable the creation of the pdf - it isn't done by default anymore. See the documentation for more details.
Thanks for the quick answer. It means that I shall add to my command line the flag
--emit_original_quals true
. Is it correct?No, you need to specify the filename for the plot. See here:
http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html#--plot_pdf_file
But that is already in my command line. I posted it before but I will copy here again:
Hmm, I missed that, sorry. Your command line looks fine, I guess it's the Rscript causing the problem. Not sure what's going on there. Have you tried loading the
.grp
file into R with gsalib?Yes, it works when I load the
.grp
file into R.I do not know if there is any problem with the GATK report because the
.bam
files I got after I did the quality score recalibration are smaller (I remember that in the GATK 2.1 version the sizes of the same files increased almost to the double). Just to give an example,sample1_realgn.bam
(after running the toolsRealignerTargetCreator
andIndelRealigner
) was about 3.3Gb. After I run the toolsBaseRecalibrator
andPrintReads
the new bam file is about 1.5Gb. If I runReduceReads
on the last file, I get a file with about 1.4Gb. Am I doing something wrong?Regarding the last post, I rerun the command line for BaseRecalibrator and PrintReads and I got the files with almost the double size as I expected (from 3.3Gb to 5.8Gb after BaseRecalibrator and PrintReads, and then 1.4Gb after ReduceReads for the sample I gave as example). There was a problem with the server I was using but I do not know how (and if) it did affect my jobs... However I still have the same problem with plots. I still not get them but I am still able to load my .grp file in R as I previously mentioned.
Ah, that confirms what I thought -- the bam file size is probably not relevant to your original problem. The debug message indicates that the R script failed, so that's most likely to be it. The question is why did the script fail. I'm consulting the team, we'll try to find the answer.
Hi,
I am also not able to generate plots with GATK2.2-3. I can load my .grp file into R with gsalib. However, running BaseRecalibrator, I get following error:
Many thanks for the advice!
Dalia
I am getting the exact same error as Dalia, and I've tried both GATK 2.2-3 and 2.2-4. It seems the error is with reading the csv file:
When I run the csv file manually through the BQSR.R script from the GATK-Lite clone from Github, it works perfectly:
Perhaps there is a difference between the BQSR.R script inside the full version of GATK2 vs GATK-Lite?
Note: Since GATK 2.2-3 and 2.2-4 don't allow the -k option for BaseRecalibrator to retain the temporary csv file, I just wrote a script that checks the temp directory every second for a csv file and moves it to a different directory before it gets automatically deleted.
There is a "hidden" argument that allows you to specify a permanent location for the csv: --intermediate_csv_file. We can make it un-hidden in the next release if that's helpful.
Thanks for the tip Eric, making that argument visible might be helpful for debugging these sorts of issues. Upon further investigation, it looks like my csv file might be malformed rather than the R script having an issue. I'm having trouble attaching a file to my post, but here are the first few lines of the csv file:
There are 10 items in the header, 10 in the first row but 11 in the second row. Looking through the rest of the file, it looks like most rows have 11 items. For the rows that have 10 items, it looks like the column that is currently called "Errors" is missing. These are just my observations; I have never looked at these covariate csv files before so I'm not sure exactly what they are supposed to look like. Could there possibly be a bug in the code that generates these csv files?
Excellent detective work! You have uncovered a major bug in the latest release of the code. What is happening is that for the Errors column the number is printed out with commas when great than 999 (which is bad in a comma-delimited file).
I will push in a fix for this ASAP and it should be available in a few hours.
Has this Issue of creating a plot from BaseRecalibrator been solved??
@shahr2, several different issues have been addressed and resolved in this thread. If you are having trouble getting plots generated, see if any of these apply to you and if so, apply the corresponding solution (a common one is upgrading to the latest version of GATK). If your problem is not resolved here, feel free to post a new question, with details (error messages if applicable, etc). However, be advised that in many cases the problem is the user's R configuration, which is not something we can provide support for.
It appears to be that a new version of ggplot2 is released: 0.9.3
After I read this thread I managed to get he plots generated.
However, in the debug it states the following:
Is this an issue in the perfomance how the plots are generated?
does this case any differences?
As i cannot compare it to any previous versions of ggplot2.
Thanks in advance.
Niels
This should not affect the generation of the plots.
I have encountered the same problem and solved the issue by installing the missing R packages indicated by running -l DEBUG flag. To summary, in order to generate pdf files, the following R packages have to be installed:
Hope this can help.
Thanks for sharing your experience, @silentio
Thank you for the above information. I've installed the above packages (within R by '> install.packages(c("bitops"))' etc) but I can't figure out how to get the pdf. I'm running GATK v2.3-9-ge5ebf34 (not from queue). Do I get the following steps right?
I first run the BaseRecalibrator to get a grp-output (paths simplified here):
java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I myfile.realigned.bam -R path/to/hg19.fa -knownSites path/to/dbsnp_137.hg19.vcf -knownSites path/to/Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites path/to/1000G_phase1.indels.hg19.vcf -o myfile.recal_data.grp
And then I run this command again with the -BQSR part (and a different -o filename):
java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I myfile.realigned.bam -R path/to/hg19.fa -knownSites path/to/dbsnp_137.hg19.vcf -knownSites path/to/Mills_and_1000G_gold_standard.indels.hg19.vcf -knownSites path/to/1000G_phase1.indels.hg19.vcf -BQSR myfile.recal_data.grp -o myfile2.recal_data.grp
The second step only adds the second grp file, almost identical to the first one.
There seems to be no problem with the PrintReads step afterwards, it's just that it would be nice also to have the plots.
Could this have something to do with how I installed the GATK? I downloaded the v2.3-9 (not Lite) and just mouse clicked the bz2-file. Should I have used some ant command(s) ?
Yes, you need to get the source download from the repository, build gsalib (do
ant gsalib
) (it's an R library we use to read in gatkreports, aka .grp files) and install it into your R environment. Sorry if this is unclear from the docs; we're trying to think of a better way to provide this library in future.Thanks Geraldine. I'm sorry to load you with more questions but I'm not too familiar with computer science language so could you tell me if I understood your instructions correctly? I did the following:
I visited the Downloads page. According to your explanations on http://gatkforums.broadinstitute.org/discussion/1244 there should've been further instructions on the Downloads page but I didn't find them so I guessed how to continue.
The Downloads page has a link (under GATK Lite) to Github, opening the Commit History page. I found that if I mouse-clicked "Tags" close to the upper right corner, it opened a list of code links.
I selected the top "Source code (tar.gz)" (version 2.3) and moved the download to my GATK folder, where I double-clicked it and continued with the unix commands:
cd path/to/GATK/gatk-2.3
ant gsalib
ant
It seemed to install normally, only warning that the walker FindCoveredIntervals was currently undocumented.
Does this mean that I've installed the latest full version GATK (not GATK Lite)? The java ... -h command doesn't reveal a version like "v2.3-9-ge5ebf34" but only writes "vexported" instead.
I'm indeed uncertain if I did this right. Should I have used more ant commands? Long ago when I ran earlier versions I could download and install updates within my path/to/GATK/destination_directory by using the unix commands "git pull; ant clean; ant". It seems I can't do this any longer. Should I have included "ant clean" when I now installed 2.3? Or more / other ant commands?
Please tell if you see something I should have done simpler or more correctly. I fear that I even selected a completely wrong code link path from the Downloads page.
With great appreciation for your tremendous work with developing gatk, making it available and helping inexperienced people like me. Thanks!
You're on the right track.
Those commands don't really "install" anything, they just compile some program files. If you didn't overwrite the GenomeAnalysisTK.jar file that you already had, then you should be able to invoke that version just as you did before.
What you still need to do is add gsalib to your R environment. To do that, go to the directory where you did
ant gsalib
, start R and doinstall.packages("gsalib")
. After that you'll be able to load the gsalib library manually if you want to process gatkreports yourself, and the GATK will be able to call the necessary rscripts to generate plots of base recalibration.If you want to be able to update the GATK using
git pull
etc. when we release a new version, you need to "clone the repository" -- there should be a button to do that on the github page. Right now this is only possible for GATK Lite, but starting next week when we release version 2.4, you'll be able to do it for the full GATK (as long as you use GATK for non-commercial work).Great! Although I didn't have luck with install.packages("gsalib"). It always directs to a Cran mirror and then says "package ‘gsalib’ is not available (for R version 2.15.2)". I tried from within the directory you suggested (path/to/GATK/gatk-2.3, where I had done ant gsalib), and also from an R subdirectory further down (path/to/GATK/gatk-2.3/public/R/src/org/broadinstitute/sting/utils/R/gsalib/R - did that make sense?). I also tried library(gsalib) in line with your explanations on http://gatkforums.broadinstitute.org/discussion/1244 and R accepted that from within the R-subdirectory (and also from elsewhere if I made a .Rprofile reference to that subdirectory). I'm still stuck with the gsalib installation but I guess this is a problem with my knowledge of R rather than gatk and I can't really expect you to spend time helping me with that. Just if you see a very simple solution ...
But I look forward to trying the full 2.4 next week. And it's good to know I can move forward (with or without the gsalib) using the GenomeAnalysisTK.jar file I had before. Thanks again for your explanation and help!
To install the R gsalib library, you can do:
this is the equivalent of what ant gsalib does. That takes care of installation into R libraries. You can confirm by doing a
library(gsalib)
from an interactive R session. Hope this helps.Yes it helped! Now R can run library(gsalib) from any directory. Thanks!
A little addition on my attempt to use the gsalib: I seem to be able to view .grp files from within R (after >library(gsalib)) since if I copy the example from http://gatkforums.broadinstitute.org/discussion/1244 into a "test.grp" file and do > d = gsa.read.gatkreport('test.grp') and > summary(d) it perfectly displays the example table. (Just to mention: the filename had to be shown within ' ' rather than "").
But the v2.3-9-ge5ebf34 BaseRecalibrator still makes a .grp file without pdf-tables. Maybe I misunderstood how it should work (or it's confined to the Sting/Lite)? And if I try (as with the test.grp file) to look at my BaseRecalibrator .grp file in R, it results in a 'colnames' error' (...'names' attribute [6] must be the same length as the vector [1]). Well, perhaps that's an inappropriate twist to the discussions in this thread, and maybe it's also irrelevant since 2.4 is on its way soon.
One more question however: Should I try adding BQSR.R, cf. the answer by rpoplin at the top of this page? If so, is there a particular path I should make and copy it to (related to my location of GenomeAnalysisTK.jar after opening the .bz2 download)?
I just checked the command lines you posted earlier; it looks like you're not specifying plot filenames. If you don't specify them the tool doesn't generate the plots. See the technical documentation to find which arguments control the plots.
Just to get back on this last post by adalgeir I've downloaded the latest version 2.3-9 I have all the R dependencies installed including gsalib. I too also see the same error traced with -l DEBUG:
I'm running with my own data here:
Is there a bug in the R code? Or have I done something wrong?
This is an error that pops up occasionally for some people and we're not sure why. Can you make sure that you have the latest version of gsalib installed? That doesn't get updated when you update the GATK itself.
The version of gsalib I'm using is one I built using ant from the GATK source on github just this week i.e. it should be current.
Matt, is this occurring when the script gets run by BaseRecalibrator, or when you try to run it yourself?
Thanks again Geraldine and apologies for not noticing the need of -plot specification. Now I've reached to the same stage as Matt. When I run BaseRecalibrator (the full version download of GenomeAnalysisTK.jar), it's exactly like Matt describes (the same error when run with -l DEBUG). I'll wait and see how this thread continues (I have to leave my office for a while). Thanks, both of you.
It's occurring when BaseRecalibrator is calling Rscipt.
Right so removing
-L chr1:1-2000
looks to have fixed my issue, the plots have been plotted by BaseRecalibrator calling Rscript, I wonder if there was insufficient data to produce them form such a small run.Hi Matt,
That's right, if you are running on a very small amount of data you end up with some empty tables in the gatkreport and so the script fails when it tries to process them. Sorry I didn't notice the -L in your command line.
Hello
I had the same problems posted before, and decided to conglomerate very good tips posted by all the users above, in order to solve problems while dealing with the .pdf files generation, sorry if I forget mentioning someone.
First I downloaded the Source Code:
wget 'https://github.com/broadgsa/gatk/archive/master.zip'
Then
unzip master.zip
Enter the gatk-master folder, execute
I wish you've got a "BUILD SUCCESSFUL" after that.
Now let's enter the R environment, in order to install some key packages, as @silentio mentioned:
now let's take a look at
BaseRecalibrator
:java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R genome.fa -I file.bam -knownSites sites.vcf -o recal.grp -intermediate recal.grp.csv
the
-intermediate
option is an important part of the process, as noticed by @ebanks. With this intermediate .csv file, we can call the Rscript to generate the plots, as @dkvitek did:Rscript path/to/gatk-master/public/R/scripts/org/broadinstitute/sting/utils/recalibration/BQSR.R recal.grp.csv recal.grp recal.grp.pdf
after some logs, you should have you
.pdf
file. You can take a look at your .pdf file by executingevince recal.grp.pdf
Run the post recalibration providing
-BQSR
and once again run theRscript
Done! Maybe some dependent R packages will be needed, but R will ask you to install them.
Thanks!
@murilogborges, thanks for putting this together!
As a caveat, please note that once you've configured your R environment as described, the GATK should be able to run the Rscripts for you, so you shouldn't need to run them manually nor keep intermediate files. However for those users who cannot seem to get that to work, this method should be a workable substitute.
Hi Geraldine,
I'm getting similar errors and have tracked it down to the ggplot library. Before I wake up my cluster administrator (I don't have rights to install R libs), can you please confirm that this is still a problem with v 2.4-9.
If yes, do you have pointers as to which libraries and which R version needs to be installed?
Many thanks in advance!
Hi Klaus,
This hasn't changed, so you'll need to wake up your admin, sorry! The R version shouldn't be very important -- fyi I have 2.13.1 on my machine. You'll need to install ggplot2 and its dependencies, which can be obtained from CRAN in the usual way, and our own library, gsalib, which at the moment is distributed with the source code on Github. You'll need to download the source and build GATK (
ant clean dist
), and the gsalib will be in one of the resource directories created by the build. I realize this is rather inconvenient but we are planning to put gsalib in CRAN fairly soon to make it all easier.Hi
I have used the following commandline for Baserecalibration in my cluster
$ java -Xmx30g -jar GenomeAnalysisTK.jar -T BaseRecalibrator --default_platform illumina -I Sample_realigned.bam
-R ./reference/genome.fa -knownSites snps_indels.vcf -knownSites indels.vcf -o Sample_rcal.grp
-plots Sample_rcal.pdf -S LENIENT
after analysis I have the Sample_rcal.grp file but no plots Sample_rcal.pdf is generated
In my cluster R enviroment there is no gsalib, gpplot2,gplots libraries so, I installed all required pakages on my local windows R environment and I can read the GATK report as follows-
But I need to generate the plot file , How can I do this on my own machine ?
Any suggestion
Thanks
Ashutsoh
If you want to run the script that generates the plots on your own machine, you will need to add the
-intermediate
argument to your BaseRecalibrator command in order to save the intermediate .csv file (otherwise it is deleted as a temp file). Then, copy that file to your local machine. You will need also to get the GATK source from Github because you need the BQSR.R script. Once you have that, you adapt this command with the appropriate filenames and paths, and run it to generate the plots:I am also not able to generate the bqsr plot. Here is the relevant debug output
Hi @smk_84,
I'm not sure why the plotting didn't work for you here. Did you generate plots successfully previously? And what version are you using?
I did actually but it was on another server.
I am using version 2.5-2
Hmm. Your local environment may not be correctly set up on the new server. You could try running again with the option to keep the intermediate csv file, and run the BQSR.R script (which you can get from the source code repository on github) directly on the csv file. If it doesn't work it should at least yield more informative error message.
Why there's no RMSE value in generated plots?...@Geraldine
Hi @beezone,
I don't understand your question, can you please clarify?
Hello @ Geraldine, I am able to generate BQSR plot successfully using BaseRecalibrator and got some really nice recalibration plots such plots for Reported quality vs Empirical quality score. But what I am looking for is RMSE value of such plot, which is not shown/present in plot that I have generated. Please, have a look at this thread (http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr), I hope this will clarify about RMSE value that I am talking about.
Oh, I see what you mean now. The publicly available R script that is used for plotting as part of the standard pipeline doesn't include calculation of the RMSE, sorry. The plots with RMSE value that are shown in that thread are produced with internal scripts. I'll see if we can make those public but I can't promise anything.
Hello Geraldine. Sorry for the strange issue. I don't think it worths a separate discussion and it's quite minor, but maybe somehow related to this topic. I am also trying to get plots but what I am curious about: why the error which occurs while Rscript is running is not reported in the log file (logging level : INFO). So when checking run on the way I see:
...
INFO 14:55:54,431 AnalyzeCovariates - Generating csv file 'sample1.BQSR_test.csv'
INFO 14:55:54,869 AnalyzeCovariates - Generating plots file 'sample1.BQSR_plots_test.pdf'
INFO 14:55:56,688 GATKRunReport - Uploaded run statistics report to AWS S3
ERROR ------------------------------------------------------------------------------------------
ERROR stack trace
org.broadinstitute.sting.utils.R.RScriptExecutorException: RScript exited with 1. Run with -l DEBUG for more info.
at org.broadinstitute.sting.utils.R.RScriptExecutor.exec(RScriptExecutor.java:174)
at org.broadinstitute.sting.utils.recalibration.RecalUtils.generatePlots(RecalUtils.java:548)
at org.broadinstitute.sting.gatk.walkers.bqsr.AnalyzeCovariates.generatePlots(AnalyzeCovariates.java:380)
at org.broadinstitute.sting.gatk.walkers.bqsr.AnalyzeCovariates.initialize(AnalyzeCovariates.java:394)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)
ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info.
ERROR ------------------------------------------------------------------------------------------
But the log file ends with just:
...
INFO 14:55:54,431 AnalyzeCovariates - Generating csv file 'sample1.BQSR_test.csv'
INFO 14:55:54,869 AnalyzeCovariates - Generating plots file 'sample1.BQSR_plots_test.pdf'
INFO 14:55:56,688 GATKRunReport - Uploaded run statistics report to AWS S3
Do you think there is a way to get a full report in log file? Otherwise it looks a bit confusing when checking afterwards...
Maybe it's something special about running Rscript and not GATK walker?
Thank you!
@Eugenie
Hi,
You can try setting the logging level to ERROR to see if that helps.
-Sheila
Thank you, Sheila!
java -jar /data3/xxw/00-GATK/GenomeAnalysisTK.jar -T AnalyzeCovariates -R /data3/xxw/00-GATK/ucsc.hg19.fasta -before 1A.recal.06-1.grp -after 1A.recal.06-2.grp -csv BQSR.csv.
i can't get the pdf picture
@xxw Please have a look at the documentation-- there is an article that addresses this problem.
sorry, can you show me the link of article? Thank you very much.
Rscript /data3/xxw/00-GATK/BQSR.R BQSR.csv BQSR.pdf
Loading required package: methods
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
Attaching package: 'gplots'
The following object is masked from 'package:stats':
Error in file(filename, "r", blocking = TRUE) :
cannot open the connection
Calls: gsa.read.gatkreport -> file
In addition: Warning message:
In file(filename, "r", blocking = TRUE) :
cannot open file 'BQSR.pdf': No such file or directory
Execution halted
@xxw
This suggests that the filename is incorrect. Make sure to include the full path to the file (e.g. if the file is in a subdirectory, you have to specify it).
I have tried very hard to find the BQSR.R script in the Github, but can not find it. Is BQSR.R script deleted?
@cailei
Hi,
Here it is: https://github.com/broadgsa/gatk/blob/6ba57d05eb20101517d8888f999d6d1f564d2aeb/public/gatk-tools-public/src/main/resources/org/broadinstitute/gatk/utils/recalibration/BQSR.R
https://www.broadinstitute.org/gatk/guide/article?id=4294
-Sheila