Attention:
The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!
Latest Release: 03/12/19
Release Notes can be found here.

task marked as Failed but logs and bucket contents have no hint of why?

esalinasesalinas BroadMember, Broadie ✭✭✭

I have a WDL task that ran in FireCloud that has 3 files to be de-localized by their listing in the output block.
The task is one of 17 tasks in a scatter group. The one I refer to here is #9.

The task is marked as "Failed" and I wonder why because I cannot find a reason why.

The stdout file in the bucket is empty and has no clue to why the "Failed" mark. It is expected to be empty as another "scatter" job finished fine and had an empty stdout file.

The stderr file has no hint of error.

The JES log has no hint of error either. In addition, inspection of the events via "gcloud alpha genomics operations describe" shows a succession of events and the 3 files in the output block are successfully copied back to the bucket.

In addition, the -rc.txt file is in the bucket and contains a zero.

What are the reasons a task would be marked as "Failed"?

Could it be that a different task is "Failed", but this task is marked as "Failed" instead? In the UI all the other tasks are marked as "Running.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'm not sure but it could be that there was a timeout in the last stages of communication about the task status. I think that seems more likely than switching status between tasks.

    The team is working on reliability-related improvements that should alleviate this sort of issue. I'm not sure we can do anything else on this one.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @Geraldine_VdAuwera

    Coming back from the weekend I thought about this again. I note that the same WDL ran successfully on a small BAM but not on a BIG bam. This was a clue. I first thought it was from out-of-disk error during the task. Absence of such a message in the logs and the fact that all files expected to be delocalized were delocalized seemed to be inconsistent with an out-of-disk-space error. Then, I remembered that I was using a read_int to get the size of the output (a BAM). So I think the issue is in the task but after the VM runs. So, I go to FC and then I use google-chrome "developer tools" and its network tab. There, I find the message :

      "failures": [{
        "message": "Failed to evaluate outputs.: Could not evaluate ClipReads.bam_size = read_int(\"bam_size.dat\")\n\tFor input string: \"5972278044\""
      }],
    

    I usage of the google-chrome developer tools network tab recommended to find errors like this? Can be there advisement of the types of things "read_int" can take/read/use? the max/min sizes of ints?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @esalinas, nice job finding the error message but you definitely shouldn't have to do that; this is a bug that affects display of errors, and we have a ticket to fix it here.

    The constraints on read_int should ultimately be found in the WDL specification docs. Those docs are currently being overhauled right now, so I'm not sure what is the status of this particular item, but for something like that you can ask in the WDL forum.

Sign In or Register to comment.