Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.

Task in "Running" state a day after stderr says it failed

aryeearyee Member, Broadie

I have a workflow (ID 48e54c59-9d56-4dd7-8745-a5cca7ccaa30) where a task failed a day ago according to the stderr:

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/opt/conda/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/opt/conda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/opt/conda/lib/python2.7/site-packages/multiprocess/pool.py", line 389, in _handle_results
    task = get()
  File "/opt/conda/lib/python2.7/site-packages/dill/dill.py", line 277, in loads
    return load(file)
  File "/opt/conda/lib/python2.7/site-packages/dill/dill.py", line 266, in load
    obj = pik.load()
  File "/opt/conda/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/opt/conda/lib/python2.7/pickle.py", line 982, in load_binstring
    self.append(self.read(len))
MemoryError

It looks like it ran out of RAM. The task is still listed in the "Running" state. Can I assume that this VM was terminated or is there a way to check?

Thanks.

Answers

  • aryeearyee Member, Broadie

    Related to this I found this VM active in the Cloud Console, even though my workspace reports no running workflows:

     ggp-18245640090132164171
    

    I assume it's running up costs but I don't have permission to stop it. Any advice on how to kill it?

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Typically when a task is still listed as "running," but it failed, the VM has been terminated. I will ask a developer to take a look.

    What is the name of the Workspace? Please also share the workspace with [email protected] if you haven't already.

  • EADGEADG KielMember ✭✭✭

    Hi @KateN,

    I have a similar problem I played around with my free-credits and now ended up with an unstoppable compute instance. Due it is marked up as preemptively, it will "hopefully" only run 24h, wasting up my free credits. The associated workflow is marked as failed and also marked as stopped.

    What can I do?

    Greets EADG

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Could you please share the workspace with [email protected]? We will also need to know the workspace name, and the workflow ID for the specific run you're seeing this behavior on. I would like to have a developer take a look.

  • EADGEADG KielMember ✭✭✭

    Hi @KateN,

    I send you an PN with the information.

    Greets EADG

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Thank you for sharing; I will have a developer take a look at it straight away.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Unfortunately there isn't a way for us to see how much compute was used for a job like that, which is what I was hoping our developers could check on for you. The good news, however, is that your submission, in spite of being labelled as "running" is not permanently running. It did stop, the issue is just with updating the UI to properly display the status. This is a known bug which will be patched when we update to Cromwell version 30 later this week.

  • EADGEADG KielMember ✭✭✭

    Hi @KateN ,
    next time I can take a picture:).

    Hm but it may it be possible to grant the user the right to kill their own compute engine, which was started by a workflow in FireCloud ?

    Greets,

    EADG

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    If you'd like to kill a running job, you can hit the Abort button on the monitor page for the specific job you'd like to abort. When you hit Abort, FireCloud sends your request to Google's cloud machines, asking them to be shut down. In a majority of cases, this will almost immediately shut the machine down, but sometimes the machine will hang. Google controls how they are shut down, so we unfortunately cannot influence them more than just asking for the workflow to stop.

Sign In or Register to comment.