Our documentation websites are currently offline due to a data center fire. We do not yet have an ETA for restoring service; we’ll update this message when we know more.

AnalyzeCovariates fails in a strange way

SvyatoslavSidorovSvyatoslavSidorov St. Petersburg, RussiaMember

Dear GATK team,

I use Queue to build a pipeline with GATK tools: RealignerTargetCreator, IndelRealigner, BaseRecalibrator, AnalyzeCovariates, and HaplotypeCaller. So, I developed the corresponding QScript. When I tested it the very first time, all functions finished with no errors. But then, when I invoked it with -startfromScratch it failed to execute AnalyzeCovariates saying:

ERROR 19:08:14,037 FunctionEdge - Error: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLi mit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'AnalyzeCovariates' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-before' 'recal-table1.txt' '-after' 'recal-table2.txt' '-plots' 'bqsr-report.pdf' '-csv' 'bqsr-report.csv' ERROR 19:08:14,045 FunctionEdge - Contents of bqsr-report.pdf.out: [...]

In bqsr-report.out file I can see no errors:

INFO 17:25:12,764 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:12,767 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:40 INFO 17:25:12,767 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:25:12,767 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:25:12,771 HelpFormatter - Program Args: -T AnalyzeCovariates -L intervals_to_process.interval_list -R Homo_sapiens_assembly38.fasta -before recal-table1.txt -after recal-table2.txt -plots bqsr-report.pdf -csv bqsr-report.csv INFO 17:25:12,779 HelpFormatter - Executing as [...] INFO 17:25:12,779 HelpFormatter - Date/Time: 2016/02/23 17:25:12 INFO 17:25:12,780 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:12,780 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:12,838 GenomeAnalysisEngine - Strictness is SILENT INFO 17:25:13,079 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:25:13,196 IntervalUtils - Processing 3088286401 bp from intervals INFO 17:25:13,277 GenomeAnalysisEngine - Preparing for traversal INFO 17:25:13,287 GenomeAnalysisEngine - Done preparing for traversal INFO 17:25:13,287 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:25:13,288 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:25:13,289 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 17:25:13,764 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 17:25:13,921 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 17:25:13,931 AnalyzeCovariates - Generating csv file 'bqsr-report.csv' INFO 17:25:14,164 AnalyzeCovariates - Generating plots file 'bqsr-report.pdf' INFO 17:25:20,279 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.gatk.tools.walkers.bqsr.AnalyzeCovariates$None@3ce295f9 INFO 17:25:20,282 ProgressMeter - done 0.0 6.0 s 11.6 w 100.0% 6.0 s 0.0 s INFO 17:25:20,283 ProgressMeter - Total runtime 7.00 secs, 0.12 min, 0.00 hours INFO 17:25:21,513 GATKRunReport - Uploaded run statistics report to AWS S3

I can see that AnalyzeCovariates and HaplotypeCaller start nearly at the same time:

INFO 17:25:09,264 FunctionEdge - Starting: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'AnalyzeCovariates' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-before' 'recal-table1.txt' '-after' 'recal-table2.txt' '-plots' 'bqsr-report.pdf' '-csv' 'bqsr-report.csv' INFO 17:25:09,264 FunctionEdge - Output written to bqsr-report.pdf.out INFO 17:25:21,548 FunctionEdge - Starting: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'HaplotypeCaller' '-I' 'recaled.bam' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-variant_index_type' 'LINEAR' '-variant_index_parameter' '128000' '-o' 'sample.gvcf' '-D' 'dbsnp_144.hg38.vcf' '-ERC' 'GVCF' '-pcrModel' 'CONSERVATIVE' INFO 17:25:21,548 FunctionEdge - Output written to sample.gvcf.out

and after HaplotypeCaller finishes successfully (I checked that it produced a gVCF file), this message about AnalyzeCovariates error is printed:

INFO 19:08:14,029 QGraph - 0 Pend, 2 Run, 0 Fail, 10 Done ERROR 19:08:14,037 FunctionEdge - Error: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLi mit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'AnalyzeCovariates' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-before' 'recal-table1.txt' '-after' 'recal-table2.txt' '-plots' 'bqsr-report.pdf' '-csv' 'bqsr-report.csv'

Of course, AnalyzeCovariates didn't produce CSV and PDF reports this way.

When I invoked the QScript again (this time not from scratch) to reproduce the situation, it executed with no errors, and AnalyzeCovariates generated both CSV and PDF reports.

After that I executed AnalyzeCovariates manually with -l DEBUG and it produced only CSV report, no PDF. Here is the central part of the debug output of AnalyzeCovariates:

`DEBUG 20:38:28,593 RecalUtils - R command line: Rscript (resource)org/broadinstitute/gatk/engine/recalibration/BQSR.R bqsr-report.csv recal-table1.txt bqsr-report1.pdf
DEBUG 20:38:28,607 RScriptExecutor - Executing:
DEBUG 20:38:28,607 RScriptExecutor - Rscript
DEBUG 20:38:28,607 RScriptExecutor - -e
DEBUG 20:38:28,608 RScriptExecutor - tempLibDir = '/tmp/Rlib.4876869209103817519';source('/tmp/BQSR.3779158431229884689.R');
DEBUG 20:38:28,608 RScriptExecutor - bqsr-report.csv
DEBUG 20:38:28,608 RScriptExecutor - recal-table1.txt
DEBUG 20:38:28,608 RScriptExecutor - bqsr-report1.pdf

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:


Warning messages:
1: NAs introduced by coercion
2: NAs introduced by coercion
DEBUG 20:38:34,790 RScriptExecutor - Result: 0
INFO 20:38:34,792 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.gatk.tools.walkers.bqsr.AnalyzeCovariates$Non
INFO 20:38:34,795 ProgressMeter - done 0.0 7.0 s 11.6 w 100.0% 7.0 s 0.0 s
INFO 20:38:34,795 ProgressMeter - Total runtime 7.03 secs, 0.12 min, 0.00 hours `

So, every time, when I start my QScript from scratch AnalyzeCovariates fails, but it finished successfully when I invoke my script the next time without -startFromScratch. When I execute AnalyzeCovariates manually with -l DEBUG, it produces CSV report only.

Should I use BQSR.R script directly?

I will be very grateful for any tips and help.

Issue · Github
by Sheila

Issue Number
Last Updated
Closed By

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    If I had to guess I would say your script doesn't fully describe the dependencies between tasks. But it's hard to say without seeing the script itself.

Sign In or Register to comment.