Example from video (AWS Batch + GATK WDL/Cromwell) does not work for me

dtenenbadtenenba Member

Hi, there's a video from AWS + The Broad which has two demos in it. The first is about setting up a Cromwell environment in AWS Batch and running a Hello World job. The second demo is about running a GATK workflow using these files and the aws.conf file from the first demo.

This demo fails for me. I expected it to work because it appears all of the S3 buckets used are publicly readable.

Here's the command I run and the error I get:

ava -Dconfig.file=aws.conf -jar cromwell-36.jar run HaplotypeCallerWF.aws.wdl -i HaplotypeCallerWF.aws.inputs.json
...
[2018-11-14 14:59:39,78] [error] WorkflowManagerActor Workflow 64cfdecd-0b9f-4ff5-a9a3-3a089390fb26 failed (during ExecutingWorkflowState): java.lang.RuntimeException: Failed to evaluate 'HaplotypeCallerGvcf_GATK4.scattered_calling_intervals' (reason 1 of 1): Evaluating read_lines(scattered_calling_intervals_list) failed: [Attempted 1 time(s)] - IOException: Could not read from s3://aws-gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt: Cannot access file: s3://s3.amazonaws.com/aws-gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt

If you look carefully at the error it is referencing the following url:

s3://s3.amazonaws.com/aws-gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt

That of course is not a valid URL, though it would be if you replaced the s3: with https: or got rid of the s3.amazonaws.com/. So cromwell (or GATK?) is mangling the url in such a way that breaks the demo. Any ideas?
For reference, the rest of the stack trace is below.

Thanks,

    at cromwell.engine.workflow.lifecycle.execution.keys.ExpressionKey.processRunnable(ExpressionKey.scala:29)
    at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.$anonfun$startRunnableNodes$7(WorkflowExecutionActor.scala:510)
    at cats.instances.ListInstances$$anon$1.$anonfun$traverse$2(list.scala:73)
    at cats.instances.ListInstances$$anon$1.loop$2(list.scala:63)
    at cats.instances.ListInstances$$anon$1.$anonfun$foldRight$1(list.scala:63)
    at cats.Eval$.loop$1(Eval.scala:341)
    at cats.Eval$.cats$Eval$$evaluate(Eval.scala:372)
    at cats.Eval$Defer.value(Eval.scala:258)
    at cats.instances.ListInstances$$anon$1.traverse(list.scala:72)
    at cats.instances.ListInstances$$anon$1.traverse(list.scala:12)
    at cats.Traverse$Ops.traverse(Traverse.scala:19)
    at cats.Traverse$Ops.traverse$(Traverse.scala:19)
    at cats.Traverse$ToTraverseOps$$anon$3.traverse(Traverse.scala:19)
    at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.cromwell$engine$workflow$lifecycle$execution$WorkflowExecutionActor$$startRunnableNodes(WorkflowExecutionActor.scala:504)
    at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$5.applyOrElse(WorkflowExecutionActor.scala:186)
    at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor$$anonfun$5.applyOrElse(WorkflowExecutionActor.scala:184)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:168)
    at akka.actor.FSM.processEvent(FSM.scala:687)
    at akka.actor.FSM.processEvent$(FSM.scala:681)
    at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.akka$actor$LoggingFSM$$super$processEvent(WorkflowExecutionActor.scala:49)
    at akka.actor.LoggingFSM.processEvent(FSM.scala:820)
    at akka.actor.LoggingFSM.processEvent$(FSM.scala:802)
    at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.processEvent(WorkflowExecutionActor.scala:49)
    at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:678)
    at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:672)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.akka$actor$Timers$$super$aroundReceive(WorkflowExecutionActor.scala:49)
    at akka.actor.Timers.aroundReceive(Timers.scala:51)
    at akka.actor.Timers.aroundReceive$(Timers.scala:40)
    at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.aroundReceive(WorkflowExecutionActor.scala:49)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
    at akka.actor.ActorCell.invoke(ActorCell.scala:557)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Answers

  • dtenenbadtenenba Member
    edited November 14

    (post edited to make comment unnecessary)

  • dtenenbadtenenba Member

    I figured out a way to get a little further with this. I downloaded the scattered calling intervals list:

    aws s3 cp s3://aws-gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt .
    

    Then in HaplotypeCallerWF.aws.inputs.json I changed HaplotypeCallerGvcf_GATK4.scattered_calling_intervals_list from s3://aws-gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt to hg38_wgs_scattered_calling_intervals.txt.

    I'm assuming that the problem here was that this part of the workflow runs on my local machine and it doesn't understand the S3 filesystem. So I downloaded the file to my local filesystem and refer to it as a local filename instead of an S3 url.

    This allowed me to proceed past the point where it failed before, and it did kick off a number of AWS batch jobs (though note that I did get a Too Many Requests error, so perhaps there needs to be some delay or exponential backoff when creating a lot of jobs. (Another suggestion might be to use Array Jobs but I realize that could be a lot of work to support).

    Unfortunately the jobs failed. The error seemed to be the same as in my other post, namely that the input file from S3 cannot be found in /cromwell_root. So far I have not been able to use S3 for input or output data in Cromwell when running with the AWS Batch back end, which makes it impossible to do real work.

    One other note, when I went to look at the jobs that failed in the AWS Batch console, I clicked on the link to view the logs and got a message that the log stream was not found. I was able to find the logs by navigating to Log Groups / /aws/batch/job and then clicking on one of the HaplotypeCaller... streams.

    Would love any guidance that could be provided regarding this. Thanks in advance.

  • dtenenbadtenenba Member

    OK, I have one more clue. Now I see in the logs of one of the failed jobs the following:

    download failed: s3://aws-broad-references/hg38/v0/Homo_sapiens_assembly38.fasta to ../cromwell_root/aws-broad-references/hg38/v0/Homo_sapiens_assembly38.fasta [Errno 28] No space left on device
    

    So then further along when this file is referenced, it fails again (do you want the equivalent of bash's set -e here so that the entire job will fail when the initial download fails?).

    I did some poking around and it does seem like there is scratch space.
    I modified the hello world job to instead run sleep infinity so I could inspect the running container. I used ECS and EC2 to find the public IP of the instance, ssh'd to it, and ran docker exec to get into the running container. Looking at the size of /cromwell_root:

    # df -h
    Filesystem                                                                                        Size  Used Avail Use% Mounted on
    ...
    /dev/xvda1                                                                                        7.8G  995M  6.7G  13% /cromwell_root
    ...
    

    It seems that the size of the file that it failed to download (s3://aws-broad-references/hg38/v0/Homo_sapiens_assembly38.fasta) is just over 3 GB, and since several containers are running on the same instance, that would quickly fill up the available 6.67G in /cromwell_root.

    Is there something I can do to increase scratch space on /cromwell_root?

    Also regarding the "too many requests" errors, I noticed this bit of configuration that can be put in the config file to govern the requests rate. I changed number-of-requests to 10000 and per to '1 seconds` in line with AWS's stated rate limits but still got "too many requests" errors. Are these settings honored in AWS or just in GCP? Is there a file like this one that includes all the possible config for AWS?

    And finally, why did I have to download the scattered calling intervals file? What prevented cromwell from seeing that file in S3? I am using the same config file that cromwell uses in its tests with the following changes:

    • include of build_application.inc.conf commented out
    • root changed to my s3 bucket
    • queueArn changed to my queue ARN

    I'm a bit puzzled that this workflow run in AWS batch seems to be part of cromwell's test suite, so I assume it works in your CI process, but it's not working for me, so wondering what I can change to make it work for me.

    Thanks in advance.

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey @dtenenba,

    Disk space
    The /cromwell_root directory is meant to be auto-expanding disk -- so if you're seeing the job fail due to out of space errors, thats unexpected.

    **too many requests***
    The configuration isn't wired in for the AWS backend, but only works for the Google backend. There is a recommended workaround on a GitHub issue.

    system {
      job-rate-control {
        jobs = 1
        per = 1 second
      }
    }
    

    Let me know if I missed anything.

    Thanks

Sign In or Register to comment.