Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

ReduceReads Memory Usage

agoutagout Posts: 3Member


I've been trying to get ReduceReads working in a pipeline I've made that incorporates GATK tools to call variants in RNA-seq data. After performing indel realignment and base recalibration I'm trying to use ReduceReads prior to calling variants using Unified Genotyper.

I've been using GATK version 2.3.9. When I try to use ReduceReads on a 1.7Gb .bam file, I need to set aside 100Gb memory to perform the operation for the process to complete (otherwise I'll get an error saying I didn't provide enough memory to run the program and to adjust the maximum heap size using the -Xmx option etc).

The problem isn't that ReduceReads doesn't work - it does, however of the 100Gb I set aside, it uses 80-90Gb of it. This means I can't run more than one job at a time due to the constraints of the machine I'm using etc.

I've been looking through the GATK forum and understand it may be a GATK version issue, though I've tried using GATK 2.5.2 ReduceReads for this step and it still requires 70-80Gb memory.

Can anyone provide any clues as to what I may be doing wrong? or whether I can do something to make it use less memory so I can run multiple jobs simultaneously?

The command I'm using is:

java -Xmx100g -Djava.io.tmpdir=/RAW/javatmp/ -jar /NMC/LCR/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T ReduceReads -R /SCRATCH/LCR/BWAIndex_hg19/genome.fa -I out.bam.sorted.bam.readGroups.bam.rmdup.bam.realigned.bam.recalibrated.bam -o out.bam.sorted.bam.readGroups.bam.rmdup.bam.realigned.bam.recalibrated.bam.reducedReads.bam

Thanks in advance,



  • ebanksebanks Posts: 683GATK Developer mod

    Hi Alex,

    When you say that Reduce Reads works on your RNA-seq data do you mean that it is actually producing correct results or that it's just not failing with an error? Because I am nearly 100% sure that it will not produce the correct results with RNA-seq data and shouldn't be used (but am happy to be wrong). If you are unsure then I'd recommend not running RR on your data.

    We are planning on producing a best practices recommendations document for RNA-seq processing with the GATK, but unfortunately we don't have anything official just yet.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • agoutagout Posts: 3Member

    Hi Eric,

    Thanks for this feedback - much appreciated! I've left RediceReads out of the pipeline.

    It would be great to have a GATK - RNA-seq processing best practices document at hand. Look forward to it.

    Best Regards, Alex

Sign In or Register to comment.