Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

"Unable to determine status of job" warn in Genome STRiP preprocessing

So when in the preprocessing step, there always occurs such kind of warning:

WARN  10:50:19,392 DrmaaJobRunner - Unable to determine status of job id 2231784
org.ggf.drmaa.DrmCommunicationException: unable to send message to qmaster using port 6448 on host "sge2": got send error
        at org.broadinstitute.gatk.utils.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:402)
        at org.broadinstitute.gatk.utils.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:392)
        at org.broadinstitute.gatk.utils.jna.drmaa.v1_0.JnaSession.getJobProgramStatus(JnaSession.java:156)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobRunner.liftedTree1$1(DrmaaJobRunner.scala:124)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobRunner.updateJobStatus(DrmaaJobRunner.scala:123)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobManager$$anonfun$updateStatus$1.apply(DrmaaJobManager.scala:56)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobManager$$anonfun$updateStatus$1.apply(DrmaaJobManager.scala:56)
        at scala.collection.immutable.Set$Set3.foreach(Set.scala:115)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobManager.updateStatus(DrmaaJobManager.scala:56)
        at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$updateStatus$1.apply(QGraph.scala:1369)
        at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$updateStatus$1.apply(QGraph.scala:1361)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at org.broadinstitute.gatk.queue.engine.QGraph.updateStatus(QGraph.scala:1361)
        at org.broadinstitute.gatk.queue.engine.QGraph.runJobs(QGraph.scala:548)
        at org.broadinstitute.gatk.queue.engine.QGraph.run(QGraph.scala:168)
        at org.broadinstitute.gatk.queue.QCommandLine.execute(QCommandLine.scala:170)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
        at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:61)
        at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala)

This will certainly cause a job to fail:

INFO  18:03:26,604 QCommandLine - Script failed: 1 Pend, 0 Run, 1 Fail, 980 Done

If this occurs at the very end of the master job (as is shown above, 1 Pend, 0 Run, 1 Fail, 980 Done), then I think intermediate files is already deleted, which causes redoing all jobs all over, if I resubmit the master job.

The " Unable to determine status of job" warning occurs randomly (not specific to a particular job), as resubmitting the master job can fix it. But if it occurs at the very end, then resubmitting the master job will redo everything.

So is there a way to prevent this warning? If not, I guess the program really needs to be improved to prevent such kind of "redo all over".

Comments

  • bhandsakerbhandsaker Member, Broadie, Moderator

    Have you verified that rerunning the job will redo the work that was already done? For example, you can rerun without "-run" and this will do a "dry run" and tell you what jobs will be redone. Queue is generally pretty reliable about not redoing work unless somehow the .*.done files get lost.

  • hjzhouhjzhou Member

    Yes. From job 6 to job 982, all need to be redone.
    And here is all files in the metadata folder:

  • skashinskashin Member

    Indeed, once the intermediate files have been deleted, rerunning the preprocessing in this directory will result in most of the jobs being rerun.
    I will make a code change to ensure that deleting intermediate files takes place at the very end of the workflow.

    For your run, I believe that only the last step that creates the file metadata/profiles_100Kb/rd.dat still needs to be run.
    The easiest way to do it would be to run the preprocessing script in a dry-run mode that outputs all the commands, and then get the java command for the last step and run it directly from a console.

  • bhandsakerbhandsaker Member, Broadie, Moderator
    edited March 23

    It is also possible that the rd.dat job succeeded. This output file (and the .done file) should be in the profiles_100Kb directory. If these files are there, you can just ignore the error. The rd.dat file is, in fact, not used in downstream processing, so you can also just ignore this error if that file is not there for some reason.

  • hjzhouhjzhou Member

    Thank you very much. I checked the profiles_100kb folder. There is no rd.dat file. Also, there is one tbi file missing for one chromosome. Since it was M/Y chromosome, which does not seem essential, I continued the downstream steps and it seemed working OK. But I guess the script still can be improved to ensure that intermediate files won't be deleted until very last.

  • pjmtelepjmtele Member
    edited June 6

    Has the fix been made to ensure that intermediate files are not deleted until the very end of the workflow?

    My SVPreprocess job failed at the final tabix stage and there was just one job pending as in the example above. When I reran the master job, it started completely from scratch. All of the data in the metadata folder subdirectories is gone.

    My version is from prior to this discussion, so perhaps I can just download a new version with the bug-fix?

    Nevermind, I see in the new release that this has been fixed.

Sign In or Register to comment.