Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

A fatal error has been detected by the Java Runtime Environment while running Genome Strip.

SyedSyed IndiaMember
edited April 2016 in GenomeSTRiP

Hello Every one,

I was trying to run Genometsrip CNV discovery for one of sample and I am getting below error.

INFO 14:42:06,029 FunctionEdge - Starting: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/.queue/tmp' '-cp' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/lib/SVToolkit.jar:/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/lib/gatk/Queue.jar' '-cp' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/lib/SVToolkit.jar:/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/lib/gatk/Queue.jar' 'org.broadinstitute.sv.discovery.SVDepthScanner' '-O' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/testCNV/cnv_stage1/seq_9/seq_9.sites.vcf.gz' '-R' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/svtoolkit/reference_metadata_bundles/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta' '-genomeMaskFile' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/svtoolkit/reference_metadata_bundles/Homo_sapiens_assembly19/Homo_sapiens_assembly19.svmask.fasta' '-genomeMaskFile' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/svtoolkit/reference_metadata_bundles/Homo_sapiens_assembly19/Homo_sapiens_assembly19.lcmask.fasta' '-genderMapFile' 'gender_map_file.txt' '-md' 'testCNV/metadata' '-configFile' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/conf/genstrip_parameters.txt' '-L' '9' '-tilingWindowSize' '1000' '-tilingWindowOverlap' '500' '-maximumReferenceGapLength' '1000'
INFO 14:42:06,030 FunctionEdge - Output written to /gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/testCNV/cnv_stage1/seq_9/logs/CNVDiscoveryStage1-1.out
#

A fatal error has been detected by the Java Runtime Environment:

#

SIGSEGV (0xb) at pc=0x00000037f8d32d5f, pid=26385, tid=47583501379328

#

JRE version: Java(TM) SE Runtime Environment (8.0_66-b17) (build 1.8.0_66-b17)

Java VM: Java HotSpot(TM) 64-Bit Server VM (25.66-b17 mixed mode linux-amd64 compressed oops)

Problematic frame:

C [libc.so.6+0x132d5f]

#

Core dump written. Default location: /gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/core or core.26385

#

An error report file with more information is saved as:

/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/hs_err_pid26385.log

#
25,1 53%
I tried both Java 1.7 and Java 1.8 with Genomestrip 2.00 1650 and 2.00 1636 as well. While Preprocessing step is working but CNVDiscovery is giving errors. I am using LSF for submitting jobs as bsub -n 8 scripname.sh .
I am attaching script for kind pursual as well.

Someone please help me on this.

Post edited by Syed on

Answers

  • SyedSyed IndiaMember

    Some one can help me on this Please

  • zihhuafangzihhuafang Switzerland Member
    Hi everyone,

    I ran into the same error message with CNVDiscovery pipeline using Genomestrip Release 2.00.1918.

    # A fatal error has been detected by the Java Runtime Environment:
    #
    # SIGSEGV (0xb) at pc=0x00002b8f92cc9015, pid=17042, tid=0x00002b8f92433700
    #
    # JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 )
    # Problematic frame:
    # C [libc.so.6+0x16f015] __strlen_sse2_pminub+0x35
    #
    # Core dump written. Default location: /cluster/home/fangzi/script/genomestrip/core or core.17042
    #
    # An error report file with more information is saved as:
    # /cluster/home/fangzi/script/genomestrip/hs_err_pid17042.log

    I have 358 WGS bovine samples with coverage range from 5x to 40x, and I have requested 100G to run CNVDiscovery pipeline on LSF. I have successfully completed preprocess, deletion discovery and genotyping, so I don't know what could be the potential reason for failing...

    Here's my script for running the pipeline :
    module load gcc/4.8.2 java/1.8.0_101 samtools/1.8

    SV_TMPDIR='/cluster/work/pausch/temp_scratch/fang/SV_tmp_cnv'
    export SV_DIR='/cluster/work/pausch/fang/svtoolkit'
    inputFile='/cluster/work/pausch/fang/svtoolkit/bam.list'
    runDir='/cluster/work/pausch/fang/svtoolkit/ucd_sv/ucd_test'
    reference_prefix='/cluster/work/pausch/fang/svtoolkit/reference_meta/ARS-UCD1.2_Btau5.0.1Y'
    outdir='/cluster/work/pausch/fang/svtoolkit/ucd_sv/ucd_test/output/cnv'
    output_prefix='test'
    gendermap='/cluster/work/pausch/fang/svtoolkit/ucd_sv/ucd_test/gendermap.txt'

    export PATH=${SV_DIR}/bwa:${PATH}
    export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}

    mkdir -p ${SV_TMPDIR}
    mkdir -p ${runDir}/logs_cnv || exit 1

    mx="-Xmx100g"
    classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"

    echo "-- Run CNVPipeline -- "
    LC_ALL=C java -cp ${classpath} ${mx} \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -cp ${classpath} \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
    -memLimit 100.0 \
    --disableJobReport \
    -R ${reference_prefix}.fa \
    -genomeMaskFile ${reference_prefix}.svmask.fasta \
    -genderMapFile ${gendermap} \
    -ploidyMapFile ${reference_prefix}.ploidymap.txt \
    -md /cluster/work/pausch/fang/svtoolkit/ucd_sv/ucd_test/metadata.list \
    -runDirectory ${runDir} \
    -jobLogDir ${runDir}/logs_cnv \
    -I ${inputFile} \
    -intervalList /cluster/work/pausch/fang/svtoolkit/chr.list \
    -tilingWindowSize 5000 \
    -tilingWindowOverlap 2500 \
    -maximumReferenceGapLength 2500 \
    -boundaryPrecision 200 \
    -minimumRefinedLength 2500 \
    -run \
    || exit 1

    Does anyone have an idea how to solve this issue?
    Thank you in advance.

    Best wishes,
    Zih-Hua
  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    This kind of error is generally not a problem with the Genome STRiP code per se, but usually some kind of environmental issue.

    Do your cluster nodes really all have 100G available? I don't see you making any memory reservations on the LSF cluster itself. You could try with less memory to see if memory is in fact the problem. You should be able to run with quite a bit less. Also, providing the complete log output might be helpful.

  • zihhuafangzihhuafang Switzerland Member
    Dear Bob,
    Thank you very much for the reply.
    I did request 100G as I submitted my batch job with this command:
    bsub -n 20 -R "rusage[mem=5000]" -W 120:00 -J cnv -o cnv_100g.out < CNVDiscovery.sh

    I initially started the job with 2G and I receive the same error message.
    I discussed with our cluster support team, and they advised me to increase the memory.
    I tested 4G, 50G and 100G, but I always got the same error message.
    Our support team advised me to contact the developer as they can't figure out what would be the potential issue.

    Please find the attached logs as the job ran with 50G and 100G.
    Thank you in advance.
  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    The pipeline will run additional jobs using the DRMAA api.

    I think the problem is that you need to use -jobNative arguments to specify additional memory for jobs that get launched by the pipeline. From the logs, I can see that the jobs are using, e.g. -Xmx51200m, but I'm not sure what your default memory limits are on your cluster. You are submitting the top-level job with -R "rusage[mem=5000]". Is that 5G? What is the default LSF memory unit on your cluster?

    I would try adding a -jobNative argument, something like -jobNative "-R rusage[mem=5000]" or whatever number is appropriate. I would start with setting -memLimit to 5, which will probably be enough, and adding a jobNative parameter to request 8G (the number to use to do this depends on your default memory unit, i.e. if it is GB, then -jobNative "-R rusage[mem=8000]"). At least some jobs should run fine this way - if you do need more memory, it will be for some jobs in later stages.

  • zihhuafangzihhuafang Switzerland Member
    edited November 4
    The default LSF memory unit on our cluster is MB, so -R "rusage[mem=5000]" is indeed 5G. For each processor, I could request up to 64G.

    I have modified my script according to your suggestion and re-ran the pipeline. However, I still received the same error message for the core dump (see attached logs). Should I increase the memory?
  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    Looking in a bit more detail at the files you sent, it seems like the problem is happening when Queue tries to submit a job in one of the worker processes.
    The CNV pipeline does a double-level of job dispatching (i.e. it dispatches a set of jobs that will in turn dispatch additional jobs).

    Two thoughts:
    a) Are your compute hosts also LSF submission hosts?
    b) What version of LSF are you running?

    For debugging, you can also try a reduced invocation using "-lastStage 1" and "-intervalList 1" (where it looks like "1" is the name of chr1 in your reference, you could pick another chromosome as well). You should do the debugging off to the side so as not to interfere with the eventual run. I suspect this will recreate the problem as well.

  • zihhuafangzihhuafang Switzerland Member
    edited November 4
    I ran the reduced invocation as you suggested. Indeed, it recreate the same problem.

    RE:
    a) the submission hosts and executing hosts are different.
    b) We have IBM Spectrum LSF Standard 10.1.0.6, May 25 2018. Could the version of LSF cause this issue?
  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    I'm worried that you are misunderstanding point (a).
    On your cluster, if you submit a job to an execution host, can the submitted job run bsub?
    In other words, are your execution hosts configured to be allowed to be submission hosts?

    You should try this with a toy example (i.e. submit a job than then does bsub). If your cluster has some nodes configured as submission hosts, it may be possible to set some flags to force your jobs to run on allowed submission hosts.

    While it's good to know the LSF version, I doubt this is the root of the problem, since you can successfully submit jobs, you ran the deletion pipeline, etc.

  • zihhuafangzihhuafang Switzerland Member
    Sorry for the misunderstanding. Now I understand what you mean.

    Yes, our execution hosts are allowed to be submission hosts.
    I have run other software that requires job submission from compute nodes without any issue.
    That's why our cluster support team and I could not understand what could be the potential environmental issue.
  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    I'm attaching two scripts that may help you debug. I had to tar them so they could be attached.
    I am able to run them here, but you will have to modify them for your environment.
    We use SGE/Uger as our cluster management software, so our configuration is different.

    The first script, test_cnv_pipeline.sh, runs a small test case using data from the GS installation. You need to copy the GS installation into the local directory (as ./svtoolkit). It will run the installtest to create metadata then use this to do a toy invocation of the CNV pipeline (just stage1). But I think this will fail for you in the same way that the full CNV pipeline is failing. The test scripts runs 3 small jobs then runs a Queue job that will do a nested job dispatch, which is the step that seems to be failing for you.

    The second script, queue_uger_wrapper.sh, is a wrapper script that we need to use in our SGE environment. You can create a similar wrapper script for your environment and use it to help debug.
    This wrapper script wraps all of the cluster job submissions (on the execution host).
    In our environment, I need to reset LD_LIBRARY_PATH (which is otherwise scrubbed) and I also need to run "use UGER" to allow me to do nested job submissions. Your situation may be similar, in that you may need to do something in the execution host environment to enable successful job submission through the DRMAA API. You can also put print statements in the wrapper script, for example to print out information about the library path, and compare it with the library path for the outer job invocation to try to determine what is different in the execution environment.

    Note that I have to use -jobRunner Uger for our SGE installation. You will need to remove this or change it to -jobRunner Lsf706. You could also try -jobRunner Drmaa if you have a DRMAA API installed.

    The proximal failure in your environment is in the initialization of the Lsf706JobRunner class. From the stack trace you sent:
    j org.broadinstitute.gatk.queue.engine.lsf.Lsf706JobRunner$.()V+81
    j org.broadinstitute.gatk.queue.engine.lsf.Lsf706JobRunner$.()V+3
    If you want to see the source code, it is here:
    https://github.com/broadinstitute/genomestrip-gatk/blob/master/public/gatk-queue/src/main/scala/org/broadinstitute/gatk/queue/engine/lsf/Lsf706JobRunner.scala
    or if that doesn't work I believe it is unchanged from this version:
    https://github.com/broadgsa/gatk/blob/master/public/gatk-queue/src/main/scala/org/broadinstitute/gatk/queue/engine/lsf/Lsf706JobRunner.scala
    I suspect it is some library difference on the execution host or perhaps an inability to successfully read some of the parameters that get initialized from the LSF configuration when the Lsf706JobRunner is instantiated.

    Let me know what you are able to learn.
    Bob

  • zihhuafangzihhuafang Switzerland Member
    Dear Bob,

    Thank you very much for the test scripts and for your help.

    The problem was solved by changing -jobRunner from "Lsf706" to "Shell" with the wrapper script. I am not sure if this is the best practice, but I could successfully run the test pipeline. With -jobRunner as Lsf706, I did not have the output such as CNVDiscoveryPipeline-1.out.

    I am also waiting for our cluster support to install DRMAA API . This could potentially solve the issue as well.

    Would you suggest other ways to proceed instead of using Shell as jobRunner?

    Best wishes,
    Zih-Hua
  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    Glad you were able to make some progress.

    Running with -jobRunner Shell will work for small cohorts, but you are running single threaded with no parallelism which will be impractical for larger cohorts.

    One alternative is to use -jobRunner ParallelShell in combination with the -maxConcurrentRun N option. This will start up to N sub-shells running in parallel. If you run on a large node with multiple cores and sufficient memory (say 4G or so per parallel job) then you can get some parallelism that way, but of course not as much as if you were able to run hundreds of jobs in parallel.

    Also, anecdotally, we have seen a higher rate of transient failures with -jobRunner Shell compared to -jobRunner ParallelShell. The failures often manifest as jobs that apparently completed successfully, but Queue thinks they failed. If you rerun, they usually run fine. If you experience this, using ParallelShell approach might be more robust, even if you use -maxConcurrentRun 1.

  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    Also, if you want to get LSF working, I would focus on getting the toy example to run by putting code in the wrapper script to print out a lot of things about the environment. Specifically, I would look for whether the lsf binaries and shared libraries (including dependencies), environment variables used by LSF, etc., are all identical. As I mentioned, here using SGE we need to force the correct libraries on to our path in the wrapper script.

Sign In or Register to comment.