Ever wish you could automatically remove your unwanted output files from a submission without having to manually review them? If so, take this two minute survey and tell us more.
Latest Release: 1/17/19
Release Notes can be found here.

Job is done but Cromwell thinks it is still running?

esalinasesalinas BroadMember, Broadie ✭✭✭
edited February 2017 in Ask the FireCloud Team

In the UI I see that a job started last night (around 5:19PM) is apparently still running this morning (~9:00AM) :

Call #15:
ID:operations/EJbDvOimKxiPoePtu4iSrLUBIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:
Running
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:Pending...
Inputs:Show
Outputs:None
stdout:Mutect1_Task-14-stdout.log
stderr:Mutect1_Task-14-stderr.log
JES log:Mutect1_Task-14.log

But when I give the OperationsID to google I see that it is not running, but finished last night about 30 min or so after it started and that the last even is "ok":

wm8b1-75c:workflows esalinas$ gcloud alpha genomics operations describe  operations/EJbDvOimKxiPoePtu4iSrLUBIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
done: true
metadata:
  '@type': type.googleapis.com/google.genomics.v1.OperationMetadata
  clientId: ''
  createTime: '2017-02-23T22:19:52Z'
  endTime: '2017-02-23T22:41:00Z'
  events:
  - description: start
    startTime: '2017-02-23T22:21:23.013200636Z'
  - description: pulling-image
    startTime: '2017-02-23T22:24:53.220856254Z'
  - description: localizing-files
    startTime: '2017-02-23T22:26:04.698573946Z'
  - description: running-docker
    startTime: '2017-02-23T22:33:37.684133510Z'
  - description: delocalizing-files
    startTime: '2017-02-23T22:40:46.430394214Z'
  - description: copied 1 file(s) to "gs://fc-6e92fceb-ef44-4648-a5b0-71cacc0248eb/52af88fa-61f0-4d7a-9f07-2daf8880f65e/CallingGroup_Workflow/a2f28101-f791-4477-b50c-78eee3eba380/call-Mutect1_Task/shard-14/MuTect1.call_stats.txt"
    startTime: '2017-02-23T22:40:48.214386407Z'
  - description: copied 1 file(s) to "gs://fc-6e92fceb-ef44-4648-a5b0-71cacc0248eb/52af88fa-61f0-4d7a-9f07-2daf8880f65e/CallingGroup_Workflow/a2f28101-f791-4477-b50c-78eee3eba380/call-Mutect1_Task/shard-14/Mutect1_Task-14-rc.txt"
    startTime: '2017-02-23T22:40:49.654216992Z'
  - description: copied 1 file(s) to "gs://fc-6e92fceb-ef44-4648-a5b0-71cacc0248eb/52af88fa-61f0-4d7a-9f07-2daf8880f65e/CallingGroup_Workflow/a2f28101-f791-4477-b50c-78eee3eba380/call-Mutect1_Task/shard-14/MuTect1.power.wig.txt"
    startTime: '2017-02-23T22:40:51.580064108Z'
  - description: copied 1 file(s) to "gs://fc-6e92fceb-ef44-4648-a5b0-71cacc0248eb/52af88fa-61f0-4d7a-9f07-2daf8880f65e/CallingGroup_Workflow/a2f28101-f791-4477-b50c-78eee3eba380/call-Mutect1_Task/shard-14/dstat.log.txt"
    startTime: '2017-02-23T22:40:52.854040039Z'
  - description: copied 1 file(s) to "gs://fc-6e92fceb-ef44-4648-a5b0-71cacc0248eb/52af88fa-61f0-4d7a-9f07-2daf8880f65e/CallingGroup_Workflow/a2f28101-f791-4477-b50c-78eee3eba380/call-Mutect1_Task/shard-14/MuTect1.coverage.wig.txt"
    startTime: '2017-02-23T22:40:57.161349237Z'
  - description: copied 1 file(s) to "gs://fc-6e92fceb-ef44-4648-a5b0-71cacc0248eb/52af88fa-61f0-4d7a-9f07-2daf8880f65e/CallingGroup_Workflow/a2f28101-f791-4477-b50c-78eee3eba380/call-Mutect1_Task/shard-14/df.log.txt"
    startTime: '2017-02-23T22:40:58.624717699Z'
  - description: ok
    startTime: '2017-02-23T22:40:59.925851688Z'
  labels: {}

Thus, there seems to be a job status inconsistency.

I want to mention that no task in the workflow is marked as failed. Moreover, the task marked "Running" is preventing a gather step from starting. Below is a copy/paste of the status from the UI.

Workflow ID:a2f28101-f791-4477-b50c-78eee3eba380
Status:
Running
Submitted:February 23, 2017, 5:10 PM (15 hours ago)
Started:February 23, 2017, 5:10 PM (15 hours ago)
Inputs:Show
Outputs:None
Workflow Log:workflow.a2f28101-f791-4477-b50c-78eee3eba380.log
Workflow Timing:Show
Calls:
CallingGroup_Workflow.MutectFC_TaskHide
Call #1:
ID:operations/EJnDvOimKxjZ05HIwsrxzw0g3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:50 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:MutectFC_Task-stdout.log
stderr:MutectFC_Task-stderr.log
JES log:MutectFC_Task.log
CallingGroup_Workflow.Mutect1_TaskHide
Call #1:
ID:operations/ELPEvOimKxiLnZHm5q7L42Ig3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-0-stdout.log
stderr:Mutect1_Task-0-stderr.log
JES log:Mutect1_Task-0.log
Call #2:
ID:operations/EMnDvOimKxjco8eCtreuoS0g3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-1-stdout.log
stderr:Mutect1_Task-1-stderr.log
JES log:Mutect1_Task-1.log
Call #3:
ID:operations/ENbDvOimKxignPHluL6G63kg3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-2-stdout.log
stderr:Mutect1_Task-2-stderr.log
JES log:Mutect1_Task-2.log
Call #4:
ID:operations/ENDDvOimKxizytWQ-eyshocBIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-3-stdout.log
stderr:Mutect1_Task-3-stderr.log
JES log:Mutect1_Task-3.log
Call #5:
ID:operations/ELTDvOimKxiz6JjZtdmmjWYg3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-4-stdout.log
stderr:Mutect1_Task-4-stderr.log
JES log:Mutect1_Task-4.log
Call #6:
ID:operations/ENjDvOimKxjV3bq1jJ_6giwg3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-5-stdout.log
stderr:Mutect1_Task-5-stderr.log
JES log:Mutect1_Task-5.log
Call #7:
ID:operations/EPXDvOimKxjwsdj28ZOLxqUBIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:36 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-6-stdout.log
stderr:Mutect1_Task-6-stderr.log
JES log:Mutect1_Task-6.log
Call #8:
ID:operations/EOvDvOimKxj1ysW3voqF7I4BIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-7-stdout.log
stderr:Mutect1_Task-7-stderr.log
JES log:Mutect1_Task-7.log
Call #9:
ID:operations/ELXDvOimKxjfu-ve9NedyUYg3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-8-stdout.log
stderr:Mutect1_Task-8-stderr.log
JES log:Mutect1_Task-8.log
Call #10:
ID:operations/EIzEvOimKxjP6b6GtJ_Y1dEBIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-9-stdout.log
stderr:Mutect1_Task-9-stderr.log
JES log:Mutect1_Task-9.log
Call #11:
ID:operations/EKvEvOimKxiS5vauk66Ejyog3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-10-stdout.log
stderr:Mutect1_Task-10-stderr.log
JES log:Mutect1_Task-10.log
Call #12:
ID:operations/ENnDvOimKxiiqLWQgqr_87oBIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-11-stdout.log
stderr:Mutect1_Task-11-stderr.log
JES log:Mutect1_Task-11.log
Call #13:
ID:operations/EOjDvOimKxiqqZ7jv7Sjqkog3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-12-stdout.log
stderr:Mutect1_Task-12-stderr.log
JES log:Mutect1_Task-12.log
Call #14:
ID:operations/EK_DvOimKxiogKnXn4q43QMg3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-13-stdout.log
stderr:Mutect1_Task-13-stderr.log
JES log:Mutect1_Task-13.log
Call #15:
ID:operations/EJbDvOimKxiPoePtu4iSrLUBIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:
Running
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:Pending...
Inputs:Show
Outputs:None
stdout:Mutect1_Task-14-stdout.log
stderr:Mutect1_Task-14-stderr.log
JES log:Mutect1_Task-14.log
Call #16:
ID:operations/EOrDvOimKxjdq4jgxv6YnJ0BIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-15-stdout.log
stderr:Mutect1_Task-15-stderr.log
JES log:Mutect1_Task-15.log
Call #17:
ID:operations/EMXDvOimKxiZiufezYq40K8BIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-16-stdout.log
stderr:Mutect1_Task-16-stderr.log
JES log:Mutect1_Task-16.log
Call #18:
ID:operations/ENXDvOimKxiho8Pv1LfZhJoBIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-17-stdout.log
stderr:Mutect1_Task-17-stderr.log
JES log:Mutect1_Task-17.log
Call #19:
ID:operations/EJ7DvOimKxiama3VyPemvb4BIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:50 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-18-stdout.log
stderr:Mutect1_Task-18-stderr.log
JES log:Mutect1_Task-18.log
Call #20:
ID:operations/EILEvOimKxiRwM3ime6Ti-ABIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-19-stdout.log
stderr:Mutect1_Task-19-stderr.log
JES log:Mutect1_Task-19.log
Call #21:
ID:operations/EN_DvOimKxiFlu3gq6TStO4BIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-20-stdout.log
stderr:Mutect1_Task-20-stderr.log
JES log:Mutect1_Task-20.log
Call #22:
ID:operations/ELzDvOimKxjbxfbbn_-dmacBIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-21-stdout.log
stderr:Mutect1_Task-21-stderr.log
JES log:Mutect1_Task-21.log
Call #23:
ID:operations/EJHEvOimKxjbwfa-pe-2uCIg3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:43 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-22-stdout.log
stderr:Mutect1_Task-22-stderr.log
JES log:Mutect1_Task-22.log
Call #24:
ID:operations/EJjDvOimKxjcpdS9ub66xk8g3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:50 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-23-stdout.log
stderr:Mutect1_Task-23-stderr.log
JES log:Mutect1_Task-23.log
Call #25:
ID:operations/EKnDvOimKxii59Gtgo3ex9cBIN6K28SDHCoPcHJvZHVjdGlvblF1ZXVl
Status:Done
Started:February 23, 2017, 5:19 PM (15 hours ago)
Ended:February 23, 2017, 5:50 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:Mutect1_Task-24-stdout.log
stderr:Mutect1_Task-24-stderr.log
JES log:Mutect1_Task-24.log
CallingGroup_Workflow.CallSomaticMutations_131_Prepare_TaskHide
Call #1:
ID:operations/EPn-meimKxjl3rfSku6d71og3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
Status:Done
Started:February 23, 2017, 5:10 PM (15 hours ago)
Ended:February 23, 2017, 5:19 PM (15 hours ago)
Inputs:Show
Outputs:Show
stdout:CallSomaticMutations_131_Prepare_Task-stdout.log
stderr:CallSomaticMutations_131_Prepare_Task-stderr.log
JES log:CallSomaticMutations_131_Prepare_Task.log

Answers

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    This was a common problem before the upgrade to Cromwell 24 (Cromwell losing track of JES job status). The upgrade to Cromwell 24 addressed this to some degree, but apparently the issue still arises, but hopefully with less frequency. Regardless, it needs to be fixed.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    @Geraldine_VdAuwera @abaumann

    All of a sudden this has become a big problem again: jobs completing, but Cromwell not recognizing that they completed, or Cromwell not recognizing that a job is completed until hours after the fact. @mnoble's team has been running into this, as has Eddie. What's going on? Was there a recent change in the system either in Cromwell or JES that is causing these issues to reappear?

  • esalinasesalinas BroadMember, Broadie ✭✭✭
    edited February 2017

    Here is a "hello world" example:

    This "hello world" submission 04bc11d0-467a-42f8-bdc0-d5a688c42b4e is a single-task WDL with NO downstream tasks.
    In the UI, it took over an hour and a half to "run" whereas the google reports it took less than 2 minutes.

    CallingGroupWorkflow.CallSomaticMutations_131_PrepareHide
    Call #1:
    ID:operations/ENra64-nKxjA7przxNPioG8g3orbxIMcKg9wcm9kdWN0aW9uUXVldWU
    Status:Done
    Started:February 24, 2017, 4:15 PM (15 hours ago)
    Ended:February 24, 2017, 5:33 PM (14 hours ago)
    Inputs:None
    Outputs:None
    stdout:CallSomaticMutations_131_Prepare-stdout.log
    stderr:CallSomaticMutations_131_Prepare-stderr.log
    JES log:CallSomaticMutations_131_Prepare.log
    
    [email protected]:~/benchmarking$ gcloud alpha genomics operations  describe  operations/ENra64-nKxjA7przxNPioG8g3orbxIMcKg9wcm9kdWN0aW9uUXVldWU|head
    done: true
    metadata:
      '@type': type.googleapis.com/google.genomics.v1.OperationMetadata
      clientId: ''
      createTime: '2017-02-24T21:15:54Z'
      endTime: '2017-02-24T21:17:19Z'
      events:
      - description: start
        startTime: '2017-02-24T21:16:31.960000515Z'
      - description: pulling-image
    [email protected]:~/benchmarking$ 
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @birger @esalinas @mnoble We're going to look into this asap.

  • ChipChip 415M 4053Member, Broadie

    Over the past week I've launched roughly ~1500 jobs from Workspaces
    rebc-oct16/rebc_template and broad-firecloud-pcawg/damage. A few percent of the jobs go into this state of being done but FC marking them as running, sometimes for hours and sometimes for days. An example of such a job is https://portal.firecloud.org/#workspaces/rebc-oct16:rebc_template/Monitor/eebab2e6-b4b9-4f55-a67e-362d2fd86780
    (workspace rebc-oct16/rebc_template, workflow wgs_pip_m2, pair THCA-EM-A3FQ-TP-NB in pair_set THCA_pairs_21Oct2016, which appears to have completed on 25 Feb 2017 around 6AM, but is marked "running" in FC even now (27 Feb 2017 9AM). After a day of waiting I restarted the job yesterday (upcoming deadline) but didn't abort the running/forgotten job.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi all, we've determined that this is due to a Cromwell issue. The dev team has a fix which should be in Cromwell 25, due to be released this week. I'll try to find out what is the ballpark ETA for a FireCloud release that would include Cromwell 25.

  • gordon123gordon123 BroadMember, Broadie
    edited February 2017

    Possibly related... this job completed in two hours, sat there for two days without FC noticing, then appears to have restarted (writing into the same location in the bucket as the previous completed run of the workflow). This seems like a stutter.

    Workflow ID:0431490f-62e4-44c3-b72f-6bb8ed8f75dd
    Status:
    Running
    Submitted:February 25, 2017, 3:27 AM (3 days ago)
    Started:February 27, 2017, 2:22 PM (55 minutes ago)
    Inputs:Show
    Outputs:None
    Workflow Log:workflow.0431490f-62e4-44c3-b72f-6bb8ed8f75dd.log
    Workflow Timing:Show
    Calls:
    pcawg_full_workflow.pcawg_fullHide
    Call #1:
    ID:operations/EM32iKOnKxjX94rHkfn6nOEBIP3g3tG1AioPcHJvZHVjdGlvblF1ZXVl
    Status:
    Preempted
    Started:February 27, 2017, 2:22 PM (55 minutes ago)
    Ended:February 27, 2017, 2:22 PM (55 minutes ago)
    Inputs:Show
    Outputs:None
    stdout:pcawg_full-stdout.log
    stderr:pcawg_full-stderr.log
    JES log:pcawg_full.log
    Call #2:
    ID:operations/EODsmYioKxiFl8qU2tLLyzUg_eDe0bUCKg9wcm9kdWN0aW9uUXVldWU
    Status:
    Running
    Started:February 27, 2017, 2:22 PM (55 minutes ago)
    Ended:Pending...
    Inputs:Show
    Outputs:None
    stdout:pcawg_full-stdout.log
    stderr:pcawg_full-stderr.log
    JES log:pcawg_full.log

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @gordon123 The team released an update to FC around the time your job restarted. I suspect that because your job's status had been lost, the update triggered a status refresh and relaunch of that particular job. This is pure speculation on my part but I think it would explain what you're seeing.

  • ChipChip 415M 4053Member, Broadie

    Is the long delay between a job finishing and Firecloud knowing the job finshed expected to continue happening? Workspace broad-firecloud-pcawg/damage workflow Mutect2_with_OB_Filters pair_set LUAD has 516 non-failed jobs (7 failed jobs were missing a bam and failed right away). Four of the 516 are marked "Succeeded". The other 512 are marked "Running" but all that I've checked say that they completed several days ago. For example Submission ID 9b7b22ce-f928-46ce-a911-0c675332b145 pair LUAD-S2-AA1A-TP-NB succeeded on March 11, 2017, 8:34 AM, but is among the 512 marked as running in the workflow view.

    Not to be pessimistic, but could this be related to dopplegangers ?

    @Geraldine_VdAuwera @birger @eddieasalinas @gordon123

  • gordon123gordon123 BroadMember, Broadie
    edited March 2017

    I don't think we've seen doppelgangers when preemptible is set to 0, in those cases the Cromwell restart appears to kill the old job.

    We did likely see a delay with the passed REBC job that was killed before any files were localized; I don't know how long it was in the state of done but not delocalized, but likely there was some window if the Cromwell restart managed to strike during it.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @gordon123

    The following OIDs correspond to a dup without pre-emptible.

    operations/EMS24NaqKxjcwfPP0-vIjdIBIP3g3tG1AioPcHJvZHVjdGlvblF1ZXVl
    operations/EJKZ9sqqKxis_KCdh4iNk9wBIP3g3tG1AioPcHJvZHVjdGlvblF1ZXVl

    The are from submission/workflow/ID : 542c5a4c-d71b-4939-bbdc-c85abee6e8d9/FilterWorkflow/8ab869f4-2e7f-4890-b781-e483e89b18dc

    I can tell they're duplicates because they point to the same exec.sh in the bucket :

    wm8b1-75c:blat_dup_not_preemptible esalinas$ cat oids.txt |xargs -n 1 -tI {} gcloud alpha genomics operations describe {} |ggrep -Pi 'exec'|ggrep -Pi 'gs://'|uniq -c
    gcloud alpha genomics operations describe operations/EMS24NaqKxjcwfPP0-vIjdIBIP3g3tG1AioPcHJvZHVjdGlvblF1ZXVl
    gcloud alpha genomics operations describe operations/EJKZ9sqqKxis_KCdh4iNk9wBIP3g3tG1AioPcHJvZHVjdGlvblF1ZXVl
       2         exec: gs://fc-9c84e685-79f8-4d84-9e52-640943257a9b/542c5a4c-d71b-4939-bbdc-c85abee6e8d9/FilterWorkflow/8ab869f4-2e7f-4890-b781-e483e89b18dc/call-blat/exec.sh
    

    I can tell they're not pre-emptible because of the pre-emptible : fase

    wm8b1-75c:blat_dup_not_preemptible esalinas$ cat oids.txt |xargs -n 1 -tI {} gcloud alpha genomics operations describe {} |ggrep -Pi 'preem'
    gcloud alpha genomics operations describe operations/EMS24NaqKxjcwfPP0-vIjdIBIP3g3tG1AioPcHJvZHVjdGlvblF1ZXVl
            preemptible: false
    gcloud alpha genomics operations describe operations/EJKZ9sqqKxis_KCdh4iNk9wBIP3g3tG1AioPcHJvZHVjdGlvblF1ZXVl
            preemptible: false
    wm8b1-75c:blat_dup_not_preemptible esalinas$ 
    
    
  • gordon123gordon123 BroadMember, Broadie

    Ok, this looks like a valid counter example. And, looking at the date stamps of the operations id's, we can see two 24 hour jobs that had a 17 hour overlap. Setting preemptible=0 in the wdl does not seem to be effective at avoiding doppelgangers.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @gordon123

    Inspection of the network tab is also consistent with non-preemptible.

        "FilterWorkflow.blat": [{
          "preemptible": false,
    ...
          "returnCode": 0,
          "jobId": "operations/EMS24NaqKxjcwfPP0-vIjdIBIP3g3tG1AioPcHJvZHVjdGlvblF1ZXVl",
    
    

    Additionally , if I inspect the JSON returned in the monitor tab, and I find "workflow" under "submittedFiles" to see the run WDL, I see the task was not set to be premptible there either:

    <br />task blat {
     File bam
     File bai
     File maf
     String id
     Int diskGB
    
     command {
     python /opt/realign.py ${bam} ${maf} ${id}
    }
    
     runtime {
     memory: \"7 GB\"
     disks: \"local-disk ${diskGB} HDD\" 
     docker: \"gcr.io/broad-firecloud-itools/blat_filter_v2\"
     }
    
     output {
     File blat_results = \"${id}.blat.maf\"
     File debug_results = \"${id}.blat.rejected.maf\"
     File blat_all_maf = \"${id}.blat.all.maf\"
     }
    }
    
    
    
    
  • ChipChip 415M 4053Member, Broadie

    Workspace broad-firecloud-pcawg/damage workflow damage_table_plots_workflow sample_set LUAD-TP (515 samples) was started 24 days ago (February 25, 2017, 3:38 AM). All jobs finished (or failed) within three days (February 27, 2017, 9:27 PM) including sample LUAD-05-4250-TP (workflow id a8b8b5cb-7445-410c-adbd-e4e68e77f925) which finished and was delocalized on February 25, 2017, 1:29 PM. This job is still listed as "Running" on the monitor tab and in the breakdown of jobs (Queued (0)Launching (0)Submitted (0)Running (1)Aborting (0)Succeeded (505)Failed (9)Aborted (0)).

    Questions:
    a) Should I abort the job - or is there some benefit (i.e. debugging) in leaving the job in the running state ?
    b) Is there any google charge accrued for running-state jobs that finished 23 days ago?

  • abaumannabaumann Broad DSDEMember, Broadie ✭✭✭

    We can check that this has any running instances (should not, especially after all this time) by looking at the operation IDs for each call. If the state is just inconsistent then it doesn't cost anything. Aborting when the state is mismatched may not abort properly and just get stuck in the aborting state.

Sign In or Register to comment.