We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Unclear error: PAPI error code 10. Message: 14: VM ggp-14729985031179362160 stopped unexpectedly.

When I was running one my workflow in aryee-merkin/dna-methylation-pipeline-paper
I got this error message: Task call_bismark_pool.merge_replicates:NA:1 failed. The job was stopped before the command finished. PAPI error code 10. Message: 14: VM ggp-14729985031179362160 stopped unexpectedly.
I was running a bunch of samples. Most of them ran successfully about 20% of them failed. These were not running in preemptible machine because I set preemptible to be 0. I have shared the workspace with [email protected]
. For this specific run the workflow ID was b788bf4a-c957-4080-b0ac-a6d4ecde94c7
.
I don't think there is something specifically wrong with these samples, because I have ran them before and they all worked fine. Can you let me know if it is an issue on my end?
Best Answers
-
Tiffany_at_Broad Cambridge, MA admin
Hi @DivyKangeyan - there was a bug with Cromwell. I am waiting to hear back more details from the team.
-
Ruchi admin
Hey @DivyKangeyan,
Your second submission failed for two reasons -- one being an intermittent cloud failure (Error code 10: Message 14) and the other being a bug around updating status of completed jobs.
It turned out the bug was the reason causing consistent failure around your submissions, and it's been addressed. You're totally correct that the message around the transient cloud failure should be clearer-- and I've filed an issue to address this. Thanks!
-
SChaluvadi admin
@DivyKangeyan @breardon Got word that this upgrade to PAPI v2 is targeted to go out beginning to middle of the next quarter - it is not something user specifiable. The maxRetries is a great option help circumvent the issues that you have been experiencing. @mleventhal - Thanks for the great suggestion! Here is a link that explains in more detail what you can achieve by adding maxRetries.
Answers
Hi @DivyKangeyan we will take a look.
Would you mind also sharing with
[email protected]
Sure, I just shared it.
I ran the same participant set agin with this submission ID
3922dfac-0696-41a1-9ee0-d94b45a84a42
. This time less than 10% of the samples failed. All the samples that failed first used preemptible machine and failed and then used non-preemptible machine and failed permanently.Hi @DivyKangeyan - we are investigating now. This one is a bit tricky because the 38 failed workflows have different failure messages. As we investigate, one thing that you could try is creating a smaller participant_set with the participants that failed. It seems like the tasks that are failing here are call_bismark_pool.align_replicates and merge_replicates and what you could do is add a Cromwell feature called maxRetries in your runtime attributes for these tasks. For example:
runtime {
maxRetries: 3
}
If they are failing from "cloud issues" this parameter will retry the task for you. Using this feature and running on a smaller participant_set may help. We will report back as we learn more.
Hi @Tiffany_at_Broad, I checked again and it seems that except for the two jobs all the others in the 38 jobs that failed had the same
The job was stopped before the command finished. PAPI error code 10. Message: 14: VM ggp-14729985031179362160 stopped unexpectedly.
message. I will try what you suggested to see if I can get around the problem.Hi @Tiffany_at_Broad, I reran the failed samples. This is the submission ID:
d3c5d92c-25a2-4dec-a544-c65df1b67b44
They all succeeded without any issues.
Hi @Tiffany_at_Broad, any update in this issue. I reran the same participant set (submission ID:
b2e7aac5-53b0-4e6e-a46a-df2aa35d0522
). They all ran without any issue. So it seems like the samples were not the problem. If this was an intermittent cloud issue, I think the error message should be clearer.Hi @DivyKangeyan - there was a bug with Cromwell. I am waiting to hear back more details from the team.
Hey @DivyKangeyan,
Your second submission failed for two reasons -- one being an intermittent cloud failure (Error code 10: Message 14) and the other being a bug around updating status of completed jobs.
It turned out the bug was the reason causing consistent failure around your submissions, and it's been addressed. You're totally correct that the message around the transient cloud failure should be clearer-- and I've filed an issue to address this. Thanks!
Hello,
I am also running into this transient error. If I run the samples that failed with this error indiviudally it succeeds, but it turns up when I run in sample sets of about 1000. Reporting to let the public know that I and others have been running into this issue.
Best,
Matt
Hi @mleventhal thank you for reporting. As a workaround are you running on smaller sample-sets?
I do find success with smaller sample sets (i.e. 300), and in my sample sets that are 1000 samples in size, I get this error from 3-8 samples (<1%) at various stages of the workflow. It would be good to find a way to avoid this error, as the sets of 1000 are already batched from a larger set of about 8000
Hi, I am still seeing this issue with firecloud jobs. With small sample sets (10 or 25) workflows run smoothly but even if the sample set contains 100 sample it fails with the
PAPI error code 10. Message: 14
.I think if this error persists it can disrupt some of the analysis that I do. I prefer cloud compared to HPC mainly because it could handle a large number of sample without wait time or other issues. What I am experiencing with
PAPI error code 10
seems to be very relevant to large sample sets.@DivyKangeyan - I am having the team take another look at this and will get back to you soon.
@DivyKangeyan Apologies for the delay - I checked again with the team and it looks like this error is still the result of a PAPI v1 bug that is causing failures. Though you may set preemptible to 0, PAPI v1 has shown that it can still fail with a preemption error like that one that you are seeing. However PAPI v2 should fix this issue - testing of it has not shown this behavior to occur.
Several members of our group are observing this issue too, even without preemptible being turned on. Is there an ETA for when this will be resolved?
@SChaluvadi I do see these errors in non-preemptible machines. My samples run fine in both preemptible and non-preemptible machine in a small sample set. How do I specify PAPI v2?
I have circumvented this issue by adding the following to the runtime block:
maxRetries: 1
This is necessary for tasks that have preemptible or non-preemptible machines, and can allow all tasks in large sample sets to succeed
@DivyKangeyan @breardon Got word that this upgrade to PAPI v2 is targeted to go out beginning to middle of the next quarter - it is not something user specifiable. The maxRetries is a great option help circumvent the issues that you have been experiencing. @mleventhal - Thanks for the great suggestion! Here is a link that explains in more detail what you can achieve by adding maxRetries.
No problem, credit goes to @bshifaw for pointing me to these threads when I had this issue before:
https://gatkforums.broadinstitute.org/firecloud/discussion/comment/51744#Comment_51744
https://gatkforums.broadinstitute.org/firecloud/discussion/comment/50231#Comment_50231