Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

ReduceReads : Memory unstable problem

zhuwenjuanzhuwenjuan chinaMember
edited September 2012 in Ask the GATK team

HI:
I use ReduceReads to reduce my bams (whole genome).
I find some bams , I just set java -Xmx4g , it works very well.
some bams , I must set java -Xmx15g or more, if not , GATK will complain :

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.1-3-g8892c10):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java
ERROR ------------------------------------------------------------------------------------------

not works bams information :
AAA:3#0 1187 chr1 10001 0 100M = 10002 101 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA
CCCTAACCCTAACACTAAC 98838833833333883393:8888388333393:888939:99:989:838823333:8:88:99:9;9;96:9:6;:::869:############### X0:i:361 X1:i:34 XA:Z:chr1,+10001,100M,1;ch
r1,+10007,100M,1;chr2,+243152478,100M,1;chr5,+10001,100M,1;chr5,+10001,100M,1;chr5,+10001,100M,1;chr5,+10001,100M,1;chr5,+10001,100M,1;chr5,+10001,100M,1;chr5,+10001,100M,1;chr5(still have many these strings) MD:Z:94C5 RG:Z:AAA_1.fq.gz XG:i:0 NM:i:1 XM:i:1 XO:i:0 MQ:i:0 OQ:Z:[email protected]=;DDCDEDBEGEF/GGGG:[email protected];D?D>D5C=B5>CCBE5DB###############

work well bams is nomal:
AAA:3#0 163 chrM 1 60 100M = 88 187 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTTTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG [email protected][email protected]@BCCCB X0:i:1 X1:i:0 MD:Z:63C36 RG:Z:AAA_L2 XG:i:0 AM:i:37 NM:i:1 SM:i:37 XM:i:1 XO:i:0 MQ:i:60 OQ:Z:GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGGGGBGFGEGEEEEEFFGGGGGGGGGFGGFEDECFGGDGGGGCGFFFGGGEEBGBFGGGF XT:A:U

what is causing this situation ? maybe be XA:Z:**, because , frist bam have such a long XA string. later bam have not.
can you fix ReduceReads to deal with both bams properly ?

Best Answer

  • Mark_DePristoMark_DePristo Broad Institute admin
    Accepted Answer

    This is an unfortunately algorithmic issue in GATK 2.1 that will intend to address in 2.2. Right now the GATK downsampler -- the algorithm that reduces excessively deep pileups from say 100,000x to 250x -- only works for LocusWalkers like the UnifiedGenotyper but not ReadWalkers like ReduceReads. Because of this, the memory needs of RR are unstable, and depend on the depth characteristics of the BAM file. Currently we hit this same issue, and we literally run RR first with 4g, then 8gb, then 16gb, automatically increasing the memory when ReduceReads fails because of the memory error. In 2.2 we have refactored the downsampling algorithm to occur before input to the GATK engine, so it'll be available to ReadWalkers like ReduceReads. With that we anticipate that ReduceReads will run 5x faster than it currently does and need 2g (or 4g) of memory.

Answers

  • Mark_DePristoMark_DePristo Broad InstituteMember admin
    Accepted Answer

    This is an unfortunately algorithmic issue in GATK 2.1 that will intend to address in 2.2. Right now the GATK downsampler -- the algorithm that reduces excessively deep pileups from say 100,000x to 250x -- only works for LocusWalkers like the UnifiedGenotyper but not ReadWalkers like ReduceReads. Because of this, the memory needs of RR are unstable, and depend on the depth characteristics of the BAM file. Currently we hit this same issue, and we literally run RR first with 4g, then 8gb, then 16gb, automatically increasing the memory when ReduceReads fails because of the memory error. In 2.2 we have refactored the downsampling algorithm to occur before input to the GATK engine, so it'll be available to ReadWalkers like ReduceReads. With that we anticipate that ReduceReads will run 5x faster than it currently does and need 2g (or 4g) of memory.

  • I encountered this issue too. Thanks Mark, that's very helpful. I don't know very much about the Java Virtual Machine, so I have a technical question I wonder if anyone can answer: if my node actually has 8GB of physical memory, is -Xmx8g the highest value I can/should set for this argument? Or is there a good reason to leave it a bit lower at, say, 7g? Or can I go even higher such as -16g and just let the operating system handle paging?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Eric,
    This is a question best asked in a separate post, in order to not mix multiple topics within a single thread.
    We would also prefer if you could please repost it in the "Ask the Community" section, since it's not really a GATK-specific question. Thanks!

  • jgarbejgarbe Member

    I'm still encountering this issue with GATK 2.3-9. I have a 50GB bam file containing 5 samples and I've been doubling the memory trying to get ReduceReads to run to completion. With -Xmx15g it ran out of memory at 20% completion, now trying 30g.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @jgarbe, that shouldn't happen with this version... We'll take a closer look if you can upload a bug report as described here:

    http://www.broadinstitute.org/gatk/guide/article?id=1894

  • sorrywmsorrywm Member

    I am also having this issue with GATK 2.3.9. My bam file is 44 GB and has 1 sample. It ran out of memory at 22.5% completion when given -Xmx16g. Is there a log of open bugs so that I can just note that I have the same issue as jgarbe?

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Have you tried running with down-sampling enabled? Something like -dcov 40 should probably do the trick.

  • dklevebringdklevebring Member

    I'm still seeing this with v2.4-3-g2a7af43. Here's my java command:
    java -Xmx4g -jar $GATK -T ReduceReads -I 310N/310N.prmdup.realign.recal.bam -o 310N/310N.prmdup.realign.recal.reduced.bam -R /home/Crisp/hg19/hg19.kary.fasta -dcov 40

    Output:

      INFO  19:39:00,197 HelpFormatter - -------------------------------------------------------------------------------- 
      INFO  19:39:00,199 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.4-3-g2a7af43, Compiled 2013/02/27 12:18:19 
      INFO  19:39:00,199 HelpFormatter - Copyright (c) 2010 The Broad Institute 
      INFO  19:39:00,199 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
      INFO  19:39:00,203 HelpFormatter - Program Args: -T ReduceReads -I 310N/310N.prmdup.realign.recal.bam -o 310N/310N.prmdup.realign.recal.reduced.bam -R /home/Crisp/hg19/hg19.kary.fasta -dcov 40 
      INFO  19:39:00,203 HelpFormatter - Date/Time: 2013/03/12 19:39:00 
      INFO  19:39:00,203 HelpFormatter - -------------------------------------------------------------------------------- 
      INFO  19:39:00,203 HelpFormatter - -------------------------------------------------------------------------------- 
      INFO  19:39:00,319 GenomeAnalysisEngine - Strictness is SILENT 
      INFO  19:39:00,366 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 40 
      INFO  19:39:00,372 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
      INFO  19:39:00,401 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 
      INFO  19:39:00,471 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files 
      INFO  19:39:00,475 GenomeAnalysisEngine - Done creating shard strategy 
      INFO  19:39:00,475 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
      INFO  19:39:00,475 ProgressMeter -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining 
      INFO  19:39:00,484 ReadShardBalancer$1 - Loading BAM index data for next contig 
      INFO  19:39:00,485 ReadShardBalancer$1 - Done loading BAM index data for next contig 
      INFO  19:39:30,527 ProgressMeter -    chr1:6446116        6.00e+05   30.0 s       50.0 s      0.2%         4.0 h     4.0 h 
      INFO  19:40:03,237 ProgressMeter -   chr1:12986315        1.20e+06   62.0 s       52.0 s      0.4%         4.1 h     4.1 h 
    

    ** snip **

      INFO  22:43:40,074 ProgressMeter -  chr4:128685531        5.18e+07    3.1 h        3.6 m     26.5%        11.6 h     8.6 h 
      INFO  22:44:12,294 ProgressMeter -  chr4:128685604        5.18e+07    3.1 h        3.6 m     26.5%        11.7 h     8.6 h 
      ##### ERROR ------------------------------------------------------------------------------------------
      INFO  22:44:51,776 ProgressMeter -  chr4:128685604        5.18e+07    3.1 h        3.6 m     26.5%        11.7 h     8.6 h 
      ##### ERROR A USER ERROR has occurred (version 2.4-3-g2a7af43): 
      ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
      ##### ERROR Please do not post this error to the GATK forum
      ##### ERROR
      ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
      ##### ERROR Visit our website and forum for extensive documentation and answers to 
      ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
      ##### ERROR
      ##### ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program.  See the -Xmx JVM argument to adjust the maximum heap size provided to Java
      ##### ERROR ------------------------------------------------------------------------------------------
    

    The input file is a regular exome, ≈20Gb in size. Nothing excessive there. I've tried 6 different exomes with the same output, but after different number of reads. I have also created a snippet file with 10k upstream and downstream of that position, and that works fine so it's not that the reads are broken.

    I should also say that it does work fine when I supply --intervals exome.interval_list.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    As indicated in the error message, you need to increase the memory heap size. RR is a memory-intensive process.

  • dklevebringdklevebring Member

    Well yeah.

    "With that we anticipate that ReduceReads will run 5x faster than it currently does and need 2g (or 4g) of memory. - See more at: http://gatkforums.broadinstitute.org/discussion/1449/reducereads-memory-unstable-problem#sthash.HrV260ph.dpuf"

    I was under the impression that this would be fixed in GATK 2.2+. I can't really understand the need for RR to be memory intensive under normal circumstances. no, the memory keeps building during the run up to a point where it crashes (at Xmx). Why not read a chunk, handle it, write processed reads to the outfile and repeat that way? Is there really a need to keep all that data in memory during the run?

    Thanks
    Daniel

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    There is a known memory issue with RR connected to compression of the read names that we are trying to address for GATK 2.5. You might want to try turning off name compression (--dont_compress_read_names) and see if that helps.

  • dklevebringdklevebring Member

    @ebanks Thanks, I'll try that. Sorry if I sounded blunt, I completely understand that thing gets pushed to later version. TYVM for a great suite.

    //Daniel

Sign In or Register to comment.