SVPreprocess succeeded for fewer samples, but failed for large number of samples

``Hi Genome STRiP users,

I am stuck on the step SVPreprocess. At first, I tried to run SVPreprocess with regards to 30 samples. The script I used is here. (To submit the job on the cluster, I requested 4 GB memory and 4 cores on one node.)

module purge
module load r/3.4.1
module load samtools/1.6
module load tabix/0.2.6
# Pre-installed java, version=1.8.0_171, do not need to load.

classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"

java -Xmx4g -cp ${classpath}\
     org.broadinstitute.gatk.queue.QCommandLine\
     -S ${SV_DIR}/qscript/SVPreprocess.q\
     -S ${SV_DIR}/qscript/SVQScript.q\
     -cp ${classpath}\
     -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
     -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
     -R /proj/yunligrp/users/yesu/hs38DH/hs38DH.fa \
     -L chr16:1-500000 \
     -I /proj/yunligrp/users/minzhi/gs_script/JHS_30.list \
     -md md_tempdir \
     -ploidyMapFile /proj/yunligrp/users/minzhi/gs_script/standard_ploidy.map \
     -jobLogDir logs \
     -run \
     || exit 1

Where the "standard_ploidy.map" is written by myself as below.

* * * * 2

I guess that running SVPreprocess was successful because the tail of what printed on the screen is that

...
INFO  21:56:48,102 QJobsReporter - Writing JobLogging GATKReport to file /proj/yunligrp/users/minzhi/gs_test_svpreprocess/SVPreprocess.jobreport.txt 
INFO  21:56:48,126 QGraph - 0 Pend, 0 Run, 0 Fail, 114 Done
INFO  21:56:48,127 QCommandLine - Writing final jobs report... 
...
------------------------------------------------------------------------------------------
Done. There were no warn messages.
------------------------------------------------------------------------------------------

But in the report.txt, the intermediate is always false. Does it mean that failed? Or is there anything that can ensure that I run SVPreprocess successfully?

Next, I run SVPreprocess with regards to the whole 3418 samples with the same script. To submit the job on the cluster, I request 40 GB memory and 8 cores on one node. However, there are two fails based on the tail of what printed on the screen.

The first one is that:

...
ERROR 22:08:33,202 FunctionEdge - Error:  'java'  '-Xmx3072m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs_test_svpreprocess/.queue/tmp'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.sv.apps.ExtractBAMSubset'  '-I' '/proj/yunligrp/users/minzhi/gs_script/NWD.recab.list'  '-O' '/proj/yunligrp/users/minzhi/gs_test_svpreprocess/md_tempdir/headers.bam'  '-R' '/proj/yunligrp/users/yesu/hs38DH/hs38DH.fa'  '-L' 'NONE'
ERROR 22:08:33,205 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs_test_svpreprocess/logs/SVPreprocess-3.out:
INFO  22:06:50,764 HelpFormatter - --------------------------------------------------------- 
INFO  22:06:50,766 HelpFormatter - Program Name: org.broadinstitute.sv.apps.ExtractBAMSubset 
INFO  22:06:50,769 HelpFormatter - Program Args: -I /proj/yunligrp/users/minzhi/gs_script/NWD.recab.list -O /proj/yunligrp/users/minzhi/gs_test_svpreprocess/md_tempdir/headers.bam -R /proj/yunligrp/users/yesu/hs38DH/hs38DH.fa -L NONE 
INFO  22:06:50,774 HelpFormatter - Executing as [email protected] on Linux 3.10.0-693.11.6.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_171-b10. 
INFO  22:06:50,774 HelpFormatter - Date/Time: 2018/06/26 22:06:50 
INFO  22:06:50,774 HelpFormatter - --------------------------------------------------------- 
INFO  22:06:50,775 HelpFormatter - --------------------------------------------------------- 
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.<init>(SAMTextHeaderCodec.java:268)
    at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:95)
    at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:655)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:376)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:202)
    at org.broadinstitute.sv.dataset.SAMFileLocation.createSamFileReader(SAMFileLocation.java:97)
    at org.broadinstitute.sv.dataset.SAMLocation.createSamFileReader(SAMLocation.java:41)
    at org.broadinstitute.sv.util.sam.SAMUtils.getMergedSAMFileHeader(SAMUtils.java:86)
    at org.broadinstitute.sv.apps.ExtractBAMSubset.run(ExtractBAMSubset.java:104)
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
    at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
    at org.broadinstitute.sv.apps.ExtractBAMSubset.main(ExtractBAMSubset.java:74) 
INFO  22:08:33,205 QGraph - Writing incremental jobs reports... 
...

It mentioned that the content is in the SVPreprocess-3.out, and content in this .out file is the same as the above.

Besides, there is also a weird part, -L None, in the content above. In fact, except at the head of the printed content, all other ‘-L’ was followed by ‘None’ (but all the ‘-genomeInterval’ are correctly followed by 'chr16:1-500000'). Is this related to the error above?

The second one is that:

...
ERROR 05:16:55,166 FunctionEdge - Error:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs_test_svpreprocess/.queue/tmp'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVCommandLine '-T' 'ComputeInsertSizeHistogramsWalker'  '-R' '/proj/yunligrp/users/yesu/hs38DH/hs38DH.fa'  '-I' '/proj/yunligrp/users/yesu/NWD.recab.bam/NWD389372.recab.bam'  '-O' '/proj/yunligrp/users/minzhi/gs_test_svpreprocess/md_tempdir/isd/NWD389372.recab.hist.bin'  '-disableGATKTraversal' 'true'  '-md' 'md_tempdir'  '-genomeInterval' 'chr16:1-500000'  '-configFile' '/proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt'  '-chimerismFile' 'md_tempdir/isd/NWD389372.recab.chimer.dat'  '-createHistogramFile' 'true'  -createEmpty  
ERROR 05:16:55,167 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs_test_svpreprocess/logs/SVPreprocess-2992.out:
INFO  04:23:26,924 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  04:23:26,926 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7.GS-r1748-0-g74bfe0b, Compiled 2018/04/10 10:30:23 
...

But the content in SVPreprocess-2992.out shows nothing wrong (attached below).

INFO  04:23:26,924 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  04:23:26,926 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7.GS-r1748-0-g74bfe0b, Compiled 2018/04/10 10:30:23 
INFO  04:23:26,926 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  04:23:26,926 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  04:23:26,927 HelpFormatter - [Wed Jun 27 04:23:26 EDT 2018] Executing on Linux 3.10.0-693.11.6.el7.x86_64 amd64 
INFO  04:23:26,927 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_171-b10 
INFO  04:23:26,930 HelpFormatter - Program Args: -T ComputeInsertSizeHistogramsWalker -R /proj/yunligrp/users/yesu/hs38DH/hs38DH.fa -O /proj/yunligrp/users/minzhi/gs_test_svpreprocess/md_tempdir/isd/NWD389372.recab.hist.bin -disableGATKTraversal true -md md_tempdir -genomeInterval chr16:1-500000 -configFile /proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt -chimerismFile md_tempdir/isd/NWD389372.recab.chimer.dat -createHistogramFile true -createEmpty 
INFO  04:23:26,935 HelpFormatter - Executing as minz[email protected] on Linux 3.10.0-693.11.6.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_171-b10. 
INFO  04:23:26,935 HelpFormatter - Date/Time: 2018/06/27 04:23:26 
INFO  04:23:26,935 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  04:23:26,935 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  04:23:26,940 27-Jun-2018 GenomeAnalysisEngine - Strictness is SILENT
INFO  04:23:27,959 27-Jun-2018 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO  04:23:28,766 27-Jun-2018 GenomeAnalysisEngine - Preparing for traversal
INFO  04:23:28,781 27-Jun-2018 GenomeAnalysisEngine - Done preparing for traversal
INFO  04:23:28,782 27-Jun-2018 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  04:23:28,782 27-Jun-2018 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  04:23:28,782 27-Jun-2018 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime
INFO  04:23:28,787 27-Jun-2018 MetaData - Opening metadata ... 
INFO  04:23:28,787 27-Jun-2018 MetaData - Adding metadata location md_tempdir ...
INFO  04:23:28,789 27-Jun-2018 MetaData - Opened metadata.
INFO  04:23:29,521 27-Jun-2018 ProgressMeter -            done         0.0     0.0 s       8.6 d      100.0%     0.0 s       0.0 s
INFO  04:23:29,522 27-Jun-2018 ProgressMeter - Total runtime 0.74 secs, 0.01 min, 0.00 hours
------------------------------------------------------------------------------------------
Done. There were no warn messages.
------------------------------------------------------------------------------------------

At first, I guess that increasing the memory requested will fix the first problem, so I request 220GB as the memory, and in a new empty folder re-run the SVPreprocess. However, the error keeps the same, even the indexes of the suspension log files keep the same -- SVPreprocess-3.out and SVPreprocess-2992.out.

May I have your suggestions about these errors and questions? Thank you very much.

Best regards,
Wusheng

Comments

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    I suspect the root of the problem is that you are not using a supplied reference metadata bundle.
    If you are using hg38, you should not just supply your own fasta file.
    Instead, you should download the hg38 reference bundle from here:
    http://software.broadinstitute.org/software/genomestrip/node_ReferenceMetadata.html
    Unpack the tar file in your local environment and set -R to point to the reference fasta file within this directory. Other arguments are defaulted off of the location of the -R .fasta file.

  • zhangwushengzhangwusheng Member

    Hi Bob,

    Thank you very much for your help. I tried to download Homo_sapiens_assembly38_12Oct2016.tar.gz, and untar it at my cluster. Then I pointing the -R to the "Homo_sapiens_assembly38.fasta" without changing anything else. And my script became

    module purge
    module load r/3.5.0
    module load samtools/1.8
    module load tabix/0.2.6
    
    classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
    
    java -Xmx4g -cp ${classpath}\
         org.broadinstitute.gatk.queue.QCommandLine\
         -S ${SV_DIR}/qscript/SVPreprocess.q\
         -S ${SV_DIR}/qscript/SVQScript.q\
         -cp ${classpath}\
         -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
         -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
         -R /proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
         -L chr16:1-500000 \
         -I /proj/yunligrp/users/minzhi/gs_script/NWD.recab.list \
         -md md_tempdir \
         -tempDir /proj/yunligrp/users/minzhi/gs_tempdir \
         -ploidyMapFile /proj/yunligrp/users/minzhi/gs_script/standard_ploidy.map \
         -jobLogDir logs \
         -run \
         || exit 1
    

    But the GCoverhead problem remains the same:

    ERROR 13:29:35,515 FunctionEdge - Error:  'java'  '-Xmx3072m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs_tempdir'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.sv.apps.ExtractBAMSubset'  '-I' '/proj/yunligrp/users/minzhi/gs_script/NWD.recab.list'  '-O' '/proj/yunligrp/users/minzhi/gs_test_svpreprocess/md_tempdir/headers.bam'  '-R' '/proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta'  '-L' 'NONE'  
    ERROR 13:29:35,518 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs_test_svpreprocess/logs/SVPreprocess-3.out:
    INFO  13:27:34,270 HelpFormatter - --------------------------------------------------------- 
    INFO  13:27:34,272 HelpFormatter - Program Name: org.broadinstitute.sv.apps.ExtractBAMSubset 
    INFO  13:27:34,276 HelpFormatter - Program Args: -I /proj/yunligrp/users/minzhi/gs_script/NWD.recab.list -O /proj/yunligrp/users/minzhi/gs_test_svpreprocess/md_tempdir/headers.bam -R /proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta -L NONE 
    INFO  13:27:34,281 HelpFormatter - Executing as [email protected] on Linux 3.10.0-693.11.6.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_171-b10. 
    INFO  13:27:34,281 HelpFormatter - Date/Time: 2018/06/29 13:27:34 
    INFO  13:27:34,281 HelpFormatter - --------------------------------------------------------- 
    INFO  13:27:34,282 HelpFormatter - --------------------------------------------------------- 
    Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:136)
        at htsjdk.samtools.SAMTextHeaderCodec.advanceLine(SAMTextHeaderCodec.java:139)
        at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:94)
        at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:655)
        at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
        at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
        at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:376)
        at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:202)
        at org.broadinstitute.sv.dataset.SAMFileLocation.createSamFileReader(SAMFileLocation.java:97)
        at org.broadinstitute.sv.dataset.SAMLocation.createSamFileReader(SAMLocation.java:41)
        at org.broadinstitute.sv.util.sam.SAMUtils.getMergedSAMFileHeader(SAMUtils.java:86)
        at org.broadinstitute.sv.apps.ExtractBAMSubset.run(ExtractBAMSubset.java:104)
        at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
        at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
        at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
        at org.broadinstitute.sv.apps.ExtractBAMSubset.main(ExtractBAMSubset.java:74) 
    

    I requested 4 cores and 10GB memory, and after I canceled the job, I found that it only took about 5GB memory. So is there anything wrong with my -tempDir flag -- will it lead to the problem? Or are the version of the softwares I am using is not camptible? May I have your suggestions about this problem?

    Thank you in advance.

    Best regards,
    Wusheng

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    The immediate problem is the -Xmx3g argument to ExtractBAMSubset. This is trying to extract the headers from the input bam files (to allow efficient access later). I think the htsjdk code for bam headers is not very memory efficient. I would suggest editing the SVQScript.q file to give more memory to this job called CreateMergedBamHeaders. Specifically increase this:
    this.javaMemoryLimit = Some(3)
    And you might want to turn up memoryLimit also to match what you have requested on the node.

  • zhangwushengzhangwusheng Member

    Hi Bob,

    Thank you very much for your help! This really works. And I guess when you say "you might want to turn up memoryLimit also", you mean

    this.javaMemoryLimit = Some(3)
    

    in the class "class CreateMergedBamHeaders(bamFile: File) " in the file SVQScript.q. Is this correct? Or there is a memorylimit flag on the running script of the SVCNVDiscovery?

    The edited class CreateMergedBamHeaders(bamFile: File) is shown as below.

        class CreateMergedBamHeaders(bamFile: File) 
            extends JavaCommand {
            this.javaMainClass = "org.broadinstitute.sv.apps.ExtractBAMSubset"
            this.inputFile = bamInputs
            this.outputFile = bamFile
            this.memoryLimit = Some(27)
            /*this.javaMemoryLimit = Some(3)*/
            this.javaMemoryLimit = Some(27)
            val readGroupMappingFiles =
                metaDataLocationList
                    .map(md => new File(md, "read_groups.dat"))
                    .filter(file => file.exists)
            commandArguments +=
                optional(" -R ", referenceFile) +
                optional(" -IC ", inputFileIndexCache) +
                repeat(" -sample ", sampleList) +
                required(" -L ", "NONE") +
                repeat(" -readGroupMappingFile ", readGroupMappingFiles)
        }
    
  • bhandsakerbhandsaker Member, Broadie, Moderator admin
Sign In or Register to comment.