What is the limit on the jobs that can be launched while analyzing a large data set

mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

Hello,

I am analyzing a data set that has 78,000 samples where my method runs five serial tasks on a MAF (i.e. no scattering). As of now I am running my analysis task in batches where I do not run the next batch until the previous analysis has completed, and while I am finding success in running analyses, I am curious how large I can make my batches in this scenario. I recall that a few months ago the reported limit was 60,000 jobs, but I was not sure if that meant 60,000 jobs launched all at once or if it means that if I add 60,000+ jobs to the queue it will exceed FireCloud's limit. I am under the impression that the number of jobs I could run at once would be limited by my VM quotas, which are well below 60,000. This would mean that while I have many jobs in the queue, the number of tasks that would actually be running at once would not push FireCloud's limit. Am I correct in this assumption, or should I ensure that the number of workflows launched not push the total number of workflows over 60,000? What would be your recommended batch size in this case? Thank you!

Best,
Matt

Best Answers

  • mleventhalmleventhal Cambridge, MA ✭✭
    Accepted Answer

    Adding "maxRetries" to the tasks with the most common transient errors cut the number of transient failures in about half in a set with 4000 samples. The remaining failures were transient errors in tasks with and without preemptibles that lacked the "maxRetries" argument. It appears that if there is any need to run in large (1000+) batches any task in the WDL must have a "maxRetries" argument. Thank you for your help,

Answers

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Thanks for your question. I'm not completely sure on that so I will have to get back to you.

  • mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

    Hi Kate,

    At your suggestion I tried running a batch of 4000. While most of the runs completed successfully and the cloud remains functional, 99 of my jobs had the following transient error: "PAPI error code 10. 14: VM ggp-15845444537195379259 stopped unexpectedly."

    When running in batches of 1000, I would see this error at most twice, but it seems to have occurred much more frequently in larger sample sets. If I run the failed tasks individually, they will run, but I would like to know if there is a way to avoid this issue. Thank you!

  • bshifawbshifaw Member, Broadie, Moderator admin

    @mleventhal
    In summary, it occurs when Google Pipelines API fails for one reason or another and produces a very general error message PAPI error code 10. 14. I don't believe there is much to be done since this is an issue on the cloud end, though it should be less likely once FireCloud shifts to using PAPI V2.

    The following post has some details about the PAPI error code 10. 14 :
    preempted job w/ no caching set to failed state
    There is an issue ticket in cromwell to create a clear message for this error.

  • mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

    While this confirms my suspicion that this error is a transient cloud failure, my question is why the cloud has these sporadic failures and why the frequency of these failures appears to be proportional to the number of jobs launched from a sample set. Will this issue be resolved with PAPI 2.0, and if so when should we expect its implementation?

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    @mleventhal , I will mark Beri's answer as "Accepted." Please let us know if you have any doubts or further questions.

  • mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

    @Tiffany_at_Broad while I understand that the failures fall into the category of "the job failed unexpectedly", that the solution is to simply run the failed tasks on their own to get them to succeed, and that this should happen less frequently when FireCloud upgrades to PAPI v2, I still do not understand why this race condition failure increases dramatically with respect to the sample set size. Here are the numbers I have been observing:

    <1,000 jobs: no vm shutdowns, no jobs failed due to transient error

    1,000 jobs: 0-3 jobs failed because of the transient error

    4,000 jobs: 90-100 jobs failed due to transient error

    The jump from 3 to 100 failures is startling, and while I understand what causes the error, I do not understand why it occurs with much greater frequency in larger sample sets. Is there something I am misunderstanding? Thank you!

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @mleventhal -
    I spoke with our contact at Google and he referred me to this page:
    "For reference, we've observed from historical data that the average preemption rate varies between 5% and 15% per seven days per project, occasionally spiking higher depending on time and zone. Keep in mind that this is an observation only: preemptible instances have no guarantees or SLAs for preemption rates or preemption distributions"

    In conclusion, it might be that the higher job/VM count, combined with zone/time of day pushed this particular projects preemption rate closer to the 15% than the 5%.

    Does that help?

  • mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

    Hi @Tiffany_at_Broad,

    The jobs that would normally fail are those where I do not specify any pre-emption. Given the information you shared, would a variable preemption rate impact whether a job where I do not populate the "preemptible" parameter in the runtime block of the WDL be affected?

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Good point. I wouldn't think so. I have followed up with my Google contact and will let you know more once I hear back.

  • bshifawbshifaw Member, Broadie, Moderator admin

    The jobs that would normally fail are those where I do not specify any pre-emption. Given the information you shared, would a variable preemption rate impact whether a job where I do not populate the "preemptible" parameter in the runtime block of the WDL be affected?

    Hey @mleventhal,

    Please review the forum threads that were posted earlier, many of your questions have already been answered regarding this error message.
    https://gatkforums.broadinstitute.org/firecloud/discussion/comment/51744#Comment_51744
    https://gatkforums.broadinstitute.org/firecloud/discussion/comment/50231#Comment_50231

  • mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭

    I see. I will report back if increasing "maxRetries" circumvents the issue.

  • mleventhalmleventhal Cambridge, MAMember, Broadie ✭✭
    Accepted Answer

    Adding "maxRetries" to the tasks with the most common transient errors cut the number of transient failures in about half in a set with 4000 samples. The remaining failures were transient errors in tasks with and without preemptibles that lacked the "maxRetries" argument. It appears that if there is any need to run in large (1000+) batches any task in the WDL must have a "maxRetries" argument. Thank you for your help,

  • bshifawbshifaw Member, Broadie, Moderator admin

    Thanks for sharing!

Sign In or Register to comment.