Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

SVPreprocess Error: java.lang.OutOfMemoryError

Dear Genome STRiP users,

I can successfully complete SVPreprocess to 3418 samples. However, when I enlarge the sample size to 10686, it returned an error as below and it seems like that the even after this error happened, the header.bam was not created.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3664)
    at java.lang.String.<init>(String.java:207)
    at java.lang.StringBuilder.toString(StringBuilder.java:407)
    at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:126)
    at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:655)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:376)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:202)
    at org.broadinstitute.sv.dataset.SAMFileLocation.createSamFileReader(SAMFileLocation.java:97)
    at org.broadinstitute.sv.dataset.SAMLocation.createSamFileReader(SAMLocation.java:41)
    at org.broadinstitute.sv.util.sam.SAMUtils.getMergedSAMFileHeader(SAMUtils.java:86)
    at org.broadinstitute.sv.apps.ExtractBAMSubset.run(ExtractBAMSubset.java:104)
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
    at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
    at org.broadinstitute.sv.apps.ExtractBAMSubset.main(ExtractBAMSubset.java:74) 

This error looks like different from the other java.lang.OutOfMemoryError I met last year since last time it shows "java.lang.OutOfMemoryError: GC Overhead" but this time it is "java.lang.OutOfMemoryError: Java heap space".

Is there any possibility to modify the SVQScript.q or other scripts to solve this problem? I prefer to run in a single batch because there is not a good way to merge the CNV regions from multiple batches detected by SVCNVDiscovery. Thank you very much.

Best regards,
Wusheng

Comments

  • I think I solved this problem by editing the 1284th line and the 1285th line of SVQScript.q to

            this.memoryLimit = Some(85)
            this.javaMemoryLimit = Some(85)
    

    And indeed, after creating header.bam, I found that about 83GB memory was used which persuades me that the reason to the problem above is the memory limit in the SVQScript.

    So now my question becomes: if I want to analyze more samples in a single batch, say 100000 sample, if the HPC I am working on can provide enough memory, can I always edit the two lines above to let my job run? Or is there other hard limitation of Genome STRiP on the sample size (# samples)? Thank you very much.

    Best regards,
    Wusheng

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    To process large data sets (tens of thousands of samples), we have been working on some new pipelines which will eventually become Genome STRiP v3.

    These pipelines are designed for cloud computing. We also became convinced that it would not be practical or desirable to do true simultaneous calling on such large cohorts. So instead the new pipelines do discovery in batches, consolidate the discovered sites, and then regenotype all discovered sites, again in batches.

    We are currently using around 100 samples per batch, mostly to keep cost down (some of the algorithms are super-linear in sample size). For really large sample sizes, you might want to consider a similar strategy.

Sign In or Register to comment.