Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

SVDiscovery Error: GC overhead limit exceeded

Dear Genome STRiP users,

I am running SVDiscovery with regards to 3418 sample (SVPreprocess completed without any errors) with the following script:

module purge
module load r/3.5.0
module load samtools/1.8
module load tabix/0.2.6
module load drmaa/1.0.7-PSNC

classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
svpreprocess_dir="/proj/yunligrp/users/minzhi/gs_test_svpreprocess_fulllist_success"
rundir="/proj/yunligrp/users/minzhi/gs_test_svdiscovery"

java -Xmx4g -cp ${classpath} \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/SVDiscovery.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -cp ${classpath} \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
    -R /proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
    -I /proj/yunligrp/users/minzhi/gs_script/NWD.recab.list \
    -genderMapFile /proj/yunligrp/users/minzhi/gs_script/JHS_full_all_male_gender.map \
    -md ${svpreprocess_dir}/md_tempdir \
    -runDirectory ${rundir} \
    -jobLogDir ${rundir}/logs \
    -O JHS_full_list_00.dels.vcf \
    -minimumSize 100 \
    -maximumSize 100000 \
    -L chr16:1-500000 \
    -jobRunner Drmaa \
    -gatkJobRunner Drmaa \
    -jobNative "--mem-per-cpu=5000 --time=02:00:00 --nodes=1 --ntasks-per-node=4" \
    -jobQueue general \
    -run \
    || exit 1

And I have edited the class CreateMergedBamHeaders(bamFile: File) of the file SVQScript.q to completed the SVPreprocess as below:

    class CreateMergedBamHeaders(bamFile: File) 
        extends JavaCommand {
        this.javaMainClass = "org.broadinstitute.sv.apps.ExtractBAMSubset"
        this.inputFile = bamInputs
        this.outputFile = bamFile
        this.memoryLimit = Some(27)
        /*this.javaMemoryLimit = Some(3)*/
        this.javaMemoryLimit = Some(27)
        val readGroupMappingFiles =
            metaDataLocationList
                .map(md => new File(md, "read_groups.dat"))
                .filter(file => file.exists)
        commandArguments +=
            optional(" -R ", referenceFile) +
            optional(" -IC ", inputFileIndexCache) +
            repeat(" -sample ", sampleList) +
            required(" -L ", "NONE") +
            repeat(" -readGroupMappingFile ", readGroupMappingFiles)
    }

However, I met the "GC overhead limit exceeded" error shown as below.

ERROR 23:30:42,257 FunctionEdge - Error:  'java'  '-Xmx3072m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs_test_svdiscovery/.queue/tmp'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVDiscovery '-T' 'SVDiscoveryWalker'  '-R' '/proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta'  '-I' '/proj/yunligrp/users/minzhi/gs_script/NWD.recab.list'  '-O' '/proj/yunligrp/users/minzhi/gs_test_svdiscovery/JHS_full_list_00.dels.unfiltered.vcf'  '-disableGATKTraversal' 'true'  '-md' '/proj/yunligrp/users/minzhi/gs_test_svpreprocess_fulllist_success/md_tempdir'  '-configFile' '/proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt' '-configFile' '/proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt'  '-runDirectory' '/proj/yunligrp/users/minzhi/gs_test_svdiscovery'  '-genderMapFile' '/proj/yunligrp/users/minzhi/gs_script/JHS_full_all_male_gender.map'  '-genomeMaskFile' '/proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta'  '-L' 'chr16:1-500000'  '-runFilePrefix' 'JHS_full_list_00.dels'  '-searchLocus' 'chr16:1-500000'  '-searchWindow' 'chr16:1-500000'  '-searchMinimumSize' '100'  '-searchMaximumSize' '100000'  '-storeReadPairFile' 'true'  
ERROR 23:30:42,278 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs_test_svdiscovery/logs/SVDiscovery-1.out:
INFO  23:27:20,394 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  23:27:20,397 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7.GS-r1748-0-g74bfe0b, Compiled 2018/04/10 10:30:23 
INFO  23:27:20,397 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  23:27:20,397 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  23:27:20,398 HelpFormatter - [Sat Jul 21 23:27:20 EDT 2018] Executing on Linux 3.10.0-862.3.2.el7.x86_64 amd64 
INFO  23:27:20,398 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_171-b10 
INFO  23:27:20,403 HelpFormatter - Program Args: -T SVDiscoveryWalker -R /proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta -O /proj/yunligrp/users/minzhi/gs_test_svdiscovery/JHS_full_list_00.dels.unfiltered.vcf -disableGATKTraversal true -md /proj/yunligrp/users/minzhi/gs_test_svpreprocess_fulllist_success/md_tempdir -configFile /proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt -configFile /proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt -runDirectory /proj/yunligrp/users/minzhi/gs_test_svdiscovery -genderMapFile /proj/yunligrp/users/minzhi/gs_script/JHS_full_all_male_gender.map -genomeMaskFile /proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta -L chr16:1-500000 -runFilePrefix JHS_full_list_00.dels -searchLocus chr16:1-500000 -searchWindow chr16:1-500000 -searchMinimumSize 100 -searchMaximumSize 100000 -storeReadPairFile true 
INFO  23:27:20,406 HelpFormatter - Executing as [email protected] on Linux 3.10.0-862.3.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_171-b10. 
INFO  23:27:20,407 HelpFormatter - Date/Time: 2018/07/21 23:27:20 
INFO  23:27:20,407 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  23:27:20,407 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  23:27:20,413 21-Jul-2018 GenomeAnalysisEngine - Strictness is SILENT
INFO  23:27:21,720 21-Jul-2018 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO  23:27:22,581 21-Jul-2018 IntervalUtils - Processing 500000 bp from intervals
INFO  23:27:22,633 21-Jul-2018 GenomeAnalysisEngine - Preparing for traversal
INFO  23:27:22,634 21-Jul-2018 GenomeAnalysisEngine - Done preparing for traversal
INFO  23:27:22,634 21-Jul-2018 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  23:27:22,634 21-Jul-2018 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  23:27:22,634 21-Jul-2018 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime
INFO  23:27:22,634 21-Jul-2018 SVDiscovery - Initializing SVDiscovery ...
INFO  23:27:22,634 21-Jul-2018 SVDiscovery - Reading configuration file ...
INFO  23:27:22,649 21-Jul-2018 SVDiscovery - Read configuration file.
INFO  23:27:22,649 21-Jul-2018 SVDiscovery - Opening reference sequence ...
INFO  23:27:22,650 21-Jul-2018 SVDiscovery - Opened reference sequence.
INFO  23:27:22,650 21-Jul-2018 SVDiscovery - Opening genome mask ...
INFO  23:27:22,653 21-Jul-2018 SVDiscovery - Opened genome mask.
INFO  23:27:22,653 21-Jul-2018 SVDiscovery - Initializing input data set ...
INFO  23:27:52,639 21-Jul-2018 ProgressMeter -        Starting         0.0    30.0 s      49.6 w      100.0%    30.0 s       0.0 s
INFO  23:28:22,645 21-Jul-2018 ProgressMeter -        Starting         0.0    60.0 s      99.2 w      100.0%    60.0 s       0.0 s
INFO  23:28:51,373 21-Jul-2018 SVDiscovery - Initialized data set: 3418 files, 26938 read groups, 3418 samples.
INFO  23:28:51,378 21-Jul-2018 MetaData - Opening metadata ... 
INFO  23:28:51,378 21-Jul-2018 MetaData - Adding metadata location /proj/yunligrp/users/minzhi/gs_test_svpreprocess_fulllist_success/md_tempdir ...
INFO  23:28:51,384 21-Jul-2018 MetaData - Opened metadata.
INFO  23:28:51,397 21-Jul-2018 SVDiscovery - Opened metadata.
INFO  23:28:51,406 21-Jul-2018 MetaData - Loading insert size distributions ...
INFO  23:28:52,647 21-Jul-2018 ProgressMeter -        Starting         0.0    90.0 s     148.8 w      100.0%    90.0 s       0.0 s
INFO  23:29:24,905 21-Jul-2018 ProgressMeter -        Starting         0.0     2.0 m     202.2 w      100.0%     2.0 m       0.0 s
INFO  23:29:57,312 21-Jul-2018 ProgressMeter -        Starting         0.0     2.6 m     255.8 w      100.0%     2.6 m       0.0 s
INFO  23:30:30,555 21-Jul-2018 ProgressMeter -        Starting         0.0     3.1 m     310.7 w      100.0%     3.1 m       0.0 s
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
    at java.lang.StringBuilder.append(StringBuilder.java:136)
    at htsjdk.samtools.SAMTextHeaderCodec.advanceLine(SAMTextHeaderCodec.java:139)
    at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:94)
    at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:655)
    at htsjdk.samtools.BAMFileReader.(BAMFileReader.java:298)
    at htsjdk.samtools.BAMFileReader.(BAMFileReader.java:176)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:376)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:202)
    at org.broadinstitute.sv.dataset.SAMFileLocation.createSamFileReader(SAMFileLocation.java:97)
    at org.broadinstitute.sv.dataset.SAMLocation.createSamFileReader(SAMLocation.java:41)
    at org.broadinstitute.sv.util.sam.SAMUtils.getMergedSAMFileHeader(SAMUtils.java:86)
    at org.broadinstitute.sv.dataset.DataSet.getMergedSAMFileHeader(DataSet.java:88)
    at org.broadinstitute.sv.discovery.DeletionDiscoveryAlgorithm.initReadPairSelection(DeletionDiscoveryAlgorithm.java:232)
    at org.broadinstitute.sv.discovery.DeletionDiscoveryAlgorithm.initialize(DeletionDiscoveryAlgorithm.java:112)
    at org.broadinstitute.sv.discovery.SVDiscoveryWalker.createDiscoveryAlgorithm(SVDiscoveryWalker.java:89)
    at org.broadinstitute.sv.discovery.SVDiscoveryWalker.initialize(SVDiscoveryWalker.java:79)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.sv.main.SVCommandLine.execute(SVCommandLine.java:141)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:91)
    at org.broadinstitute.sv.main.SVDiscovery.main(SVDiscovery.java:21) 
INFO  23:30:42,279 QGraph - Writing incremental jobs reports... 
INFO  23:30:42,279 QJobsReporter - Writing JobLogging GATKReport to file /proj/yunligrp/users/minzhi/gs_test_svdiscovery/SVDiscovery.jobreport.txt 
INFO  23:30:42,291 QGraph - 1 Pend, 0 Run, 1 Fail, 0 Done 

Does it mean that I need to edit qscript file? Thank you very much.

Best regards,
Wusheng

Answers

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    Yes. I think this was covered in a separate thread here:
    https://gatkforums.broadinstitute.org/gatk/discussion/comment/50744

  • zhangwushengzhangwusheng Member

    Hi Bob @bhandsaker ,

    That question is asked by me. And the problem reported in this thread happened after I edited the class CreateMergedBamHeaders(bamFile: File) in the SVQScript.q as in that thread. Besides, I edited the SVQScript.q again to increase the limit as below

            this.memoryLimit = Some(81)
            /*this.javaMemoryLimit = Some(3)*/
            this.javaMemoryLimit = Some(81)
    

    But the problem as reported above still happen. And I found that the GCoverhead problem reported here is a little different from the problem in that thread -- the following lines do not exist in the problem in that thread:

        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:136)
    

    Does it mean that I should edit other .q script? Thank you in advance.

    Best regards,

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    If you are running out of memory at 81G, then you should try a different workaround. I believe htsjdk is very inefficient in how it manages bam file headers.

    What this step is doing is creating a small bam file that contains merged headers from all of the input bam files (so the sample and read group metadata can be easily accessed by downstream programs without opening 3400 separate bam files).

    One workaround is to do this manually by extracting the bam headers (as text), merging them manually (into a valid SAM file), and then converting sam -> bam using samtools. The main caveat is that you will need to make sure that there are no collisions between the read group IDs.

    To scale to large cohorts, we have generally moved to doing batch-based calling. In our cloud-based pipelines we do this in batches of about 100 samples and then merge across batches. A single batch of 3418 samples is doable but will be slow. For some of the downstream calling, the CNV pipeline in particular, runtime is superlinear in the number of samples.

    You can either run the preprocessing in batches (most efficient), or run the preprocessing in a single batch and then do the calling in smaller batches (e.g. by providing sample lists in the inputs).

  • zhangwushengzhangwusheng Member

    Hi [email protected]:

    Thank you very much for your detailed explanation and multiple solutions. And to the 120 samples, I indeed can run the SVDiscovery without any error. However, I still have two questions that need your help:

    1. Based on the documentation of Genome STRiP, the sample size for analysis is at least 30. Does it mean that the more sample we have, the more accurate our result will be? If so, will there be an obvious error if I do a batch-based calling (each batch size is 100 samples) rather than do a single 3418 samples calling?

    2. Weird memory usage in SVDiscovery (failed but not running out of memory at 81G):
      Even though I request 10GB when submitting the job, and just as mentioned above, edit the SVQScript.q as below:

            this.memoryLimit = Some(81)
            /*this.javaMemoryLimit = Some(3)*/
            this.javaMemoryLimit = Some(81)
    

    The usage of the memory of this job is only about 1GB (1075024K) but not out of memory at 81 GB. So I guess that my situation is not what you described, and may I have your suggestion in this situation? Does it mean that there is any other hidden bug? Thank you in advance.

    Best regards,
    Wusheng

Sign In or Register to comment.