Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

SVPreprocess succeeded for fewer samples, but failed for large number of samples

``Hi Genome STRiP users,

I am stuck on the step SVPreprocess. At first, I tried to run SVPreprocess with regards to 30 samples. The script I used is here. (To submit the job on the cluster, I requested 4 GB memory and 4 cores on one node.)

module purge
module load r/3.4.1
module load samtools/1.6
module load tabix/0.2.6
# Pre-installed java, version=1.8.0_171, do not need to load.

classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"

java -Xmx4g -cp ${classpath}\
     org.broadinstitute.gatk.queue.QCommandLine\
     -S ${SV_DIR}/qscript/SVPreprocess.q\
     -S ${SV_DIR}/qscript/SVQScript.q\
     -cp ${classpath}\
     -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
     -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
     -R /proj/yunligrp/users/yesu/hs38DH/hs38DH.fa \
     -L chr16:1-500000 \
     -I /proj/yunligrp/users/minzhi/gs_script/JHS_30.list \
     -md md_tempdir \
     -ploidyMapFile /proj/yunligrp/users/minzhi/gs_script/standard_ploidy.map \
     -jobLogDir logs \
     -run \
     || exit 1

Where the "standard_ploidy.map" is written by myself as below.

* * * * 2

I guess that running SVPreprocess was successful because the tail of what printed on the screen is that

...
INFO  21:56:48,102 QJobsReporter - Writing JobLogging GATKReport to file /proj/yunligrp/users/minzhi/gs_test_svpreprocess/SVPreprocess.jobreport.txt 
INFO  21:56:48,126 QGraph - 0 Pend, 0 Run, 0 Fail, 114 Done
INFO  21:56:48,127 QCommandLine - Writing final jobs report... 
...
------------------------------------------------------------------------------------------
Done. There were no warn messages.
------------------------------------------------------------------------------------------

But in the report.txt, the intermediate is always false. Does it mean that failed? Or is there anything that can ensure that I run SVPreprocess successfully?

Next, I run SVPreprocess with regards to the whole 3418 samples with the same script. To submit the job on the cluster, I request 40 GB memory and 8 cores on one node. However, there are two fails based on the tail of what printed on the screen.

The first one is that:

...
ERROR 22:08:33,202 FunctionEdge - Error:  'java'  '-Xmx3072m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs_test_svpreprocess/.queue/tmp'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.sv.apps.ExtractBAMSubset'  '-I' '/proj/yunligrp/users/minzhi/gs_script/NWD.recab.list'  '-O' '/proj/yunligrp/users/minzhi/gs_test_svpreprocess/md_tempdir/headers.bam'  '-R' '/proj/yunligrp/users/yesu/hs38DH/hs38DH.fa'  '-L' 'NONE'
ERROR 22:08:33,205 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs_test_svpreprocess/logs/SVPreprocess-3.out:
INFO  22:06:50,764 HelpFormatter - --------------------------------------------------------- 
INFO  22:06:50,766 HelpFormatter - Program Name: org.broadinstitute.sv.apps.ExtractBAMSubset 
INFO  22:06:50,769 HelpFormatter - Program Args: -I /proj/yunligrp/users/minzhi/gs_script/NWD.recab.list -O /proj/yunligrp/users/minzhi/gs_test_svpreprocess/md_tempdir/headers.bam -R /proj/yunligrp/users/yesu/hs38DH/hs38DH.fa -L NONE 
INFO  22:06:50,774 HelpFormatter - Executing as [email protected] on Linux 3.10.0-693.11.6.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_171-b10. 
INFO  22:06:50,774 HelpFormatter - Date/Time: 2018/06/26 22:06:50 
INFO  22:06:50,774 HelpFormatter - --------------------------------------------------------- 
INFO  22:06:50,775 HelpFormatter - --------------------------------------------------------- 
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.<init>(SAMTextHeaderCodec.java:268)
    at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:95)
    at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:655)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:376)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:202)
    at org.broadinstitute.sv.dataset.SAMFileLocation.createSamFileReader(SAMFileLocation.java:97)
    at org.broadinstitute.sv.dataset.SAMLocation.createSamFileReader(SAMLocation.java:41)
    at org.broadinstitute.sv.util.sam.SAMUtils.getMergedSAMFileHeader(SAMUtils.java:86)
    at org.broadinstitute.sv.apps.ExtractBAMSubset.run(ExtractBAMSubset.java:104)
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
    at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
    at org.broadinstitute.sv.apps.ExtractBAMSubset.main(ExtractBAMSubset.java:74) 
INFO  22:08:33,205 QGraph - Writing incremental jobs reports... 
...

It mentioned that the content is in the SVPreprocess-3.out, and content in this .out file is the same as the above.

Besides, there is also a weird part, -L None, in the content above. In fact, except at the head of the printed content, all other ‘-L’ was followed by ‘None’ (but all the ‘-genomeInterval’ are correctly followed by 'chr16:1-500000'). Is this related to the error above?

The second one is that:

...
ERROR 05:16:55,166 FunctionEdge - Error:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs_test_svpreprocess/.queue/tmp'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVCommandLine '-T' 'ComputeInsertSizeHistogramsWalker'  '-R' '/proj/yunligrp/users/yesu/hs38DH/hs38DH.fa'  '-I' '/proj/yunligrp/users/yesu/NWD.recab.bam/NWD389372.recab.bam'  '-O' '/proj/yunligrp/users/minzhi/gs_test_svpreprocess/md_tempdir/isd/NWD389372.recab.hist.bin'  '-disableGATKTraversal' 'true'  '-md' 'md_tempdir'  '-genomeInterval' 'chr16:1-500000'  '-configFile' '/proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt'  '-chimerismFile' 'md_tempdir/isd/NWD389372.recab.chimer.dat'  '-createHistogramFile' 'true'  -createEmpty  
ERROR 05:16:55,167 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs_test_svpreprocess/logs/SVPreprocess-2992.out:
INFO  04:23:26,924 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  04:23:26,926 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7.GS-r1748-0-g74bfe0b, Compiled 2018/04/10 10:30:23 
...

But the content in SVPreprocess-2992.out shows nothing wrong (attached below).

INFO  04:23:26,924 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  04:23:26,926 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7.GS-r1748-0-g74bfe0b, Compiled 2018/04/10 10:30:23 
INFO  04:23:26,926 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  04:23:26,926 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  04:23:26,927 HelpFormatter - [Wed Jun 27 04:23:26 EDT 2018] Executing on Linux 3.10.0-693.11.6.el7.x86_64 amd64 
INFO  04:23:26,927 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_171-b10 
INFO  04:23:26,930 HelpFormatter - Program Args: -T ComputeInsertSizeHistogramsWalker -R /proj/yunligrp/users/yesu/hs38DH/hs38DH.fa -O /proj/yunligrp/users/minzhi/gs_test_svpreprocess/md_tempdir/isd/NWD389372.recab.hist.bin -disableGATKTraversal true -md md_tempdir -genomeInterval chr16:1-500000 -configFile /proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt -chimerismFile md_tempdir/isd/NWD389372.recab.chimer.dat -createHistogramFile true -createEmpty 
INFO  04:23:26,935 HelpFormatter - Executing as [email protected] on Linux 3.10.0-693.11.6.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_171-b10. 
INFO  04:23:26,935 HelpFormatter - Date/Time: 2018/06/27 04:23:26 
INFO  04:23:26,935 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  04:23:26,935 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  04:23:26,940 27-Jun-2018 GenomeAnalysisEngine - Strictness is SILENT
INFO  04:23:27,959 27-Jun-2018 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO  04:23:28,766 27-Jun-2018 GenomeAnalysisEngine - Preparing for traversal
INFO  04:23:28,781 27-Jun-2018 GenomeAnalysisEngine - Done preparing for traversal
INFO  04:23:28,782 27-Jun-2018 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  04:23:28,782 27-Jun-2018 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  04:23:28,782 27-Jun-2018 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime
INFO  04:23:28,787 27-Jun-2018 MetaData - Opening metadata ... 
INFO  04:23:28,787 27-Jun-2018 MetaData - Adding metadata location md_tempdir ...
INFO  04:23:28,789 27-Jun-2018 MetaData - Opened metadata.
INFO  04:23:29,521 27-Jun-2018 ProgressMeter -            done         0.0     0.0 s       8.6 d      100.0%     0.0 s       0.0 s
INFO  04:23:29,522 27-Jun-2018 ProgressMeter - Total runtime 0.74 secs, 0.01 min, 0.00 hours
------------------------------------------------------------------------------------------
Done. There were no warn messages.
------------------------------------------------------------------------------------------

At first, I guess that increasing the memory requested will fix the first problem, so I request 220GB as the memory, and in a new empty folder re-run the SVPreprocess. However, the error keeps the same, even the indexes of the suspension log files keep the same -- SVPreprocess-3.out and SVPreprocess-2992.out.

May I have your suggestions about these errors and questions? Thank you very much.

Best regards,
Wusheng

Comments

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    I suspect the root of the problem is that you are not using a supplied reference metadata bundle.
    If you are using hg38, you should not just supply your own fasta file.
    Instead, you should download the hg38 reference bundle from here:
    http://software.broadinstitute.org/software/genomestrip/node_ReferenceMetadata.html
    Unpack the tar file in your local environment and set -R to point to the reference fasta file within this directory. Other arguments are defaulted off of the location of the -R .fasta file.

  • zhangwushengzhangwusheng Member

    Hi Bob,

    Thank you very much for your help. I tried to download Homo_sapiens_assembly38_12Oct2016.tar.gz, and untar it at my cluster. Then I pointing the -R to the "Homo_sapiens_assembly38.fasta" without changing anything else. And my script became

    module purge
    module load r/3.5.0
    module load samtools/1.8
    module load tabix/0.2.6
    
    classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
    
    java -Xmx4g -cp ${classpath}\
         org.broadinstitute.gatk.queue.QCommandLine\
         -S ${SV_DIR}/qscript/SVPreprocess.q\
         -S ${SV_DIR}/qscript/SVQScript.q\
         -cp ${classpath}\
         -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
         -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
         -R /proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
         -L chr16:1-500000 \
         -I /proj/yunligrp/users/minzhi/gs_script/NWD.recab.list \
         -md md_tempdir \
         -tempDir /proj/yunligrp/users/minzhi/gs_tempdir \
         -ploidyMapFile /proj/yunligrp/users/minzhi/gs_script/standard_ploidy.map \
         -jobLogDir logs \
         -run \
         || exit 1
    

    But the GCoverhead problem remains the same:

    ERROR 13:29:35,515 FunctionEdge - Error:  'java'  '-Xmx3072m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs_tempdir'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.sv.apps.ExtractBAMSubset'  '-I' '/proj/yunligrp/users/minzhi/gs_script/NWD.recab.list'  '-O' '/proj/yunligrp/users/minzhi/gs_test_svpreprocess/md_tempdir/headers.bam'  '-R' '/proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta'  '-L' 'NONE'  
    ERROR 13:29:35,518 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs_test_svpreprocess/logs/SVPreprocess-3.out:
    INFO  13:27:34,270 HelpFormatter - --------------------------------------------------------- 
    INFO  13:27:34,272 HelpFormatter - Program Name: org.broadinstitute.sv.apps.ExtractBAMSubset 
    INFO  13:27:34,276 HelpFormatter - Program Args: -I /proj/yunligrp/users/minzhi/gs_script/NWD.recab.list -O /proj/yunligrp/users/minzhi/gs_test_svpreprocess/md_tempdir/headers.bam -R /proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta -L NONE 
    INFO  13:27:34,281 HelpFormatter - Executing as [email protected] on Linux 3.10.0-693.11.6.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_171-b10. 
    INFO  13:27:34,281 HelpFormatter - Date/Time: 2018/06/29 13:27:34 
    INFO  13:27:34,281 HelpFormatter - --------------------------------------------------------- 
    INFO  13:27:34,282 HelpFormatter - --------------------------------------------------------- 
    Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:136)
        at htsjdk.samtools.SAMTextHeaderCodec.advanceLine(SAMTextHeaderCodec.java:139)
        at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:94)
        at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:655)
        at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
        at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
        at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:376)
        at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:202)
        at org.broadinstitute.sv.dataset.SAMFileLocation.createSamFileReader(SAMFileLocation.java:97)
        at org.broadinstitute.sv.dataset.SAMLocation.createSamFileReader(SAMLocation.java:41)
        at org.broadinstitute.sv.util.sam.SAMUtils.getMergedSAMFileHeader(SAMUtils.java:86)
        at org.broadinstitute.sv.apps.ExtractBAMSubset.run(ExtractBAMSubset.java:104)
        at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
        at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
        at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
        at org.broadinstitute.sv.apps.ExtractBAMSubset.main(ExtractBAMSubset.java:74) 
    

    I requested 4 cores and 10GB memory, and after I canceled the job, I found that it only took about 5GB memory. So is there anything wrong with my -tempDir flag -- will it lead to the problem? Or are the version of the softwares I am using is not camptible? May I have your suggestions about this problem?

    Thank you in advance.

    Best regards,
    Wusheng

  • bhandsakerbhandsaker Member, Broadie, Moderator admin

    The immediate problem is the -Xmx3g argument to ExtractBAMSubset. This is trying to extract the headers from the input bam files (to allow efficient access later). I think the htsjdk code for bam headers is not very memory efficient. I would suggest editing the SVQScript.q file to give more memory to this job called CreateMergedBamHeaders. Specifically increase this:
    this.javaMemoryLimit = Some(3)
    And you might want to turn up memoryLimit also to match what you have requested on the node.

  • zhangwushengzhangwusheng Member

    Hi Bob,

    Thank you very much for your help! This really works. And I guess when you say "you might want to turn up memoryLimit also", you mean

    this.javaMemoryLimit = Some(3)
    

    in the class "class CreateMergedBamHeaders(bamFile: File) " in the file SVQScript.q. Is this correct? Or there is a memorylimit flag on the running script of the SVCNVDiscovery?

    The edited class CreateMergedBamHeaders(bamFile: File) is shown as below.

        class CreateMergedBamHeaders(bamFile: File) 
            extends JavaCommand {
            this.javaMainClass = "org.broadinstitute.sv.apps.ExtractBAMSubset"
            this.inputFile = bamInputs
            this.outputFile = bamFile
            this.memoryLimit = Some(27)
            /*this.javaMemoryLimit = Some(3)*/
            this.javaMemoryLimit = Some(27)
            val readGroupMappingFiles =
                metaDataLocationList
                    .map(md => new File(md, "read_groups.dat"))
                    .filter(file => file.exists)
            commandArguments +=
                optional(" -R ", referenceFile) +
                optional(" -IC ", inputFileIndexCache) +
                repeat(" -sample ", sampleList) +
                required(" -L ", "NONE") +
                repeat(" -readGroupMappingFile ", readGroupMappingFiles)
        }
    
  • bhandsakerbhandsaker Member, Broadie, Moderator admin
Sign In or Register to comment.