Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.

Variable machine disk size

dlivitzdlivitz Member, Broadie

Hi Firecloud Team,

I am very confused about variable disk size in firecloud WDLs

My use case:
workflow inputs - (array of files)
required disk size - (total size of all inputs)*4

I found a similar question, but it was unanswered:
http://gatkforums.broadinstitute.org/firecloud/discussion/9347/size-arithmetic-issue

Is this currently possible in firecloud, and if not, what is the recommended alternative if there is one?

As a workaround I am currently writing a python script to make google api calls for each of the inputs to query the size and write an annotation with the sum, although this seems clunky.

Issue · Github
by Geraldine_VdAuwera

Issue Number
2225
State
closed
Last Updated
Closed By
vdauwera

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MA admin
    Accepted Answer

    @dlivitz I come bearing good news! What you want is feasible, with a hack. And bad news! The hack is pretty gross.

    First, yes, size() can indeed be used (and importantly, evaluated before localization) in the runtime block. So you can use that in principle to adjust disk and memory requests (nice!). However size() returns a float which is not valid for any property you want to set that requires an integer (sad!).

    Obviously if you can convert the float to int, then it will work fine. The problem is we don't currently have a nice civilized floatToInt() function. Enter the hack (and ugliness and shame) as originally reported here:

      Float f = size(infile)
      String s = f 
      String string_before_decimal = sub(s, "\\..*", "") 
      Int final_int = string_before_decimal
    
      runtime {
        memory = final_int
      }
    

    There's some discussion going on right now (like, as I type) among the Cromwell team about what is the best way to address this to get rid of the hack (coercion, explicit function call, rounding options...?). Once they've decided on a shortlist of technical options it'll go to our WDL focus group for testing on real people. So there's hope for a clean, pretty future.

    Meanwhile please try out the hacky approach and let us know how it works out for you.

Answers

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @dlivitz this makes me thinkg of "autoSize" that
    for example a built-in function in the context of a runtime block
    "automagically" knows the size of all the files being localized. Then,
    the WDL're could say something like
    diskGB=autoSize()+1 (to add 1GB of disk)
    or maybe
    diskGB=autoSize()+autoSize()*0.1
    where there's a buffer of extra space that's 10%

    That's an idea I've had anyway....

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @dlivitz, we don't currently have any functionality for calculating or setting disk size based on the inputs in a particular run, so the recommendation is to formulate a disk size setting per task based on your expectation of the range of sizes you expect to process through that particular task. The available sizes are listed in Google's documentation here.

    We have noted @esalinas' request for an autosizing functionality but that is currently not on our list of priority features.

  • dlivitzdlivitz Member, Broadie

    @Geraldine_VdAuwera , just to clarify

    The size() function as implemented in WDL spec is or is not supported inside the runtime{} block in Firecloud?

    -Dimitri

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oof, good question @dlivitz. I would assume not but will have the Cromwell team confirm or correct.

  • esalinasesalinas BroadMember, Broadie ✭✭✭
    edited June 2017

    @dlivitz @Geraldine_VdAuwera I have attempted to do so but have not observed success in any attempts
    This was one such attempt to that http://gatkforums.broadinstitute.org/firecloud/discussion/9347/size-arithmetic-issue

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Accepted Answer

    @dlivitz I come bearing good news! What you want is feasible, with a hack. And bad news! The hack is pretty gross.

    First, yes, size() can indeed be used (and importantly, evaluated before localization) in the runtime block. So you can use that in principle to adjust disk and memory requests (nice!). However size() returns a float which is not valid for any property you want to set that requires an integer (sad!).

    Obviously if you can convert the float to int, then it will work fine. The problem is we don't currently have a nice civilized floatToInt() function. Enter the hack (and ugliness and shame) as originally reported here:

      Float f = size(infile)
      String s = f 
      String string_before_decimal = sub(s, "\\..*", "") 
      Int final_int = string_before_decimal
    
      runtime {
        memory = final_int
      }
    

    There's some discussion going on right now (like, as I type) among the Cromwell team about what is the best way to address this to get rid of the hack (coercion, explicit function call, rounding options...?). Once they've decided on a shortlist of technical options it'll go to our WDL focus group for testing on real people. So there's hope for a clean, pretty future.

    Meanwhile please try out the hacky approach and let us know how it works out for you.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @Geraldine_VdAuwera would it be possible here to change the line "Int final_int = string_before_decimal" to "Int final_int = string_before_decimal+10" and work? That way, some extra memory can be added.

    or would one have to say (at the first line) Float f = size(infile)+10.0 and add the buffer earlier in the steps?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I think you should be able to do either, as those are both supported operations.

  • gordon123gordon123 BroadMember, Broadie
    edited July 2017

    I got this working. The Float, etc lines in Geraldine's example live inside the task block that surround the runtime block.

    A couple issues I ran into:

    • I wanted to add an extra 10GB; this could not be represented as 10E9 but needed to be spelled out. I did the math on the floats.
    • I needed to leave the number as a string, to be merged in with the rest of the boilerplate for setting the disk size.
    • I needed to change the input files from optional to required parameters; when they were optional parameters I got cryptic error messages like:
      message: SIZE_WEXTUMOR causedBy: message: Could not find SIZE_WEXTUMOR in input section of call mutation_validator_workflow.mutation_validator message: fc-e5920d62-ed56-47ed-8a61-bf1cac042c69/cba84375-9f79-4383-9f0e-edcc21501079/mutation_validator_workflow/c7dbb751-440f-4845-9429-a37fcfe94a1b/call-mutation_validator/"gs:/5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/ACC/DNA/WXS/BI/ILLUMINA/TCGA_MC3.TCGA-OR-A5J1-01A-11D-A29I-10.bam"

    BTW - It would be great if there could be a predefined variable that gives the sum of all file inputs, it seems common to want to size the disk to fit the input data, plus some multiplicative and/or additive factor.

    Gordon

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    I like the idea of a function to get the total size of all inputs in one go. Will suggest this to the Cromwell team.
  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    @Geraldine_VdAuwera I'd like emphasize the importance of an autosize() feature. We have many workflows that take as input an array of files. The size of the disk required to accommodate those files depends not only on the expected size of the files' file types, but also the size of the array, which is driven by the size of the entity set that the workflow is targeted at. We REALLY need cromwell to be able to pass into the runtime block the cumulative size of all files that will be localized onto the VM in order to calculate the required disk size. This request is a high priority for our analysts. Please do whatever you can to get some movement on this feature request. Other than Dimitri's python script that makes google api calls for each of the inputs to query the size and write an annotation with the sum, none of the above suggestions address the original problem of getting a disk size that can accommodate all of the files in an array. And as Dimitri observed, his workaround is "clunky".

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    I also ran some tests. First, I found that expressions to obtain the size of files and conduct arithmetic operations on those sizes work both at the task-level and workflow level. It makes most sense, however, to implement them at the task level given the disk requirements for each task, as a function of the size of the task's input files, should be captured in the task-level wdl.

    Note how I use the "G" string parameter in my call to the size function. This means that file sizes are reported in GigaBytes. I add the 1 so the string representation of the float is in decimal rather than exponential (e.g., 2.6569E-5) notation (to get Geraldine's regex to work).

    Also, cromwell 28 supports ceiling and floor operations on floats. When cromwell 28 is integrated with FireCloud, we can drop Geraldine's hack and just use the ceiling operation.

    Here is my sample code:

     task test_task {
    
         File inputFile1
         File inputFile2
         Float floatDiskSizeGB=(size(inputFile1, "G") + size(inputFile2, "G") + 1) * 2.0
         String stringDiskSizeGB=sub(floatDiskSizeGB,"\\..*","")
         Int intDiskSizeGB = stringDiskSizeGB
    
         command <<<
              echo ${floatDiskSizeGB}         
              echo ${stringDiskSizeGB}
              echo ${intDiskSizeGB}
         >>>
    
         runtime {
             disks: "local-disk ${intDiskSizeGB} HDD"
             docker : "ubuntu:14.04"
         }
    }
    
    workflow test_workflow {
        File inputFile1
        File inputFile2
        call test_task {
            input:
               inputFile1=inputFile1,
               inputFile2=inputFile2
        }
    }
    

    None of this helps when the input to a task is an array of files; we really need Eddie's proposed "autosize" function for that.

Sign In or Register to comment.