Holiday Notice:
The Frontline Support team will be offline December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks as we get to all of your questions. Happy Holidays!

Latest Release: 12/4/18
Release Notes can be found here.

Cannot read a TSV that's bigger than 128K bytes, but I can write it!

yfarjounyfarjoun Broad InstituteDev ✭✭✭

I'm trying to create a join call-set and I have a large TSV I need to read for processing. I'm getting the following error:

>message: Workflow has invalid declarations: Could not evaluate workflow declarations:
> JointGenotyping. Use of WdlSingleFile(gs://fc-abad691f-3e3e-4ddc-85a6-e399521974bf/
>6f9f789f-746a-4fa7-bc08-30154d86c919/JointGenotyping/d3c4daff-5553-4917-8e4f-0579cd9a826f/
>call-RotateGVCF/output.tsv) failed because the file was too big (100362390 bytes when 
>only files of up to 128000 bytes are permissible

The file was generated in a previous step by cromwell, so it's quite annoying that it's impossible to read it back in.

Would it be possible to increase the limit? Alternatively, if one were able to iterate over the lines of that file, that would also be good.

In the absence of this possibility, I would have to do even more processing within the task, instead of scattering using cromwell.

Tagged:

Answers

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey @yossi, Cromwell added filesize limits for read functions in release 27. The reason this was introduced was to prevent it from attempting to read very large file sizes and to reduce Cromwell's memory load. Increasing the limit may end up slowing down Cromwell overall and instead we can consider a feature that would help, such as maybe a file-of-file-names as the output of your previous task, and scatter over that instead.

    As for a workaround, maybe you can have a separate task that uses wc -l to get the number of lines in the file and outputs an array of length== line count using the range function in WDL. Then you can scatter over that array and extract the content of your input file using the index in the array. Do you think you could try something like that?

    Issue · Github
    by Geraldine_VdAuwera

    Issue Number
    2383
    State
    open
    Last Updated
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @yossi I'm negotiating a size limit increase -- are you running this on firecloud or the internal methods cromwell?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Actually that file size is a bit insane -- @yossi, is this a file-of-file-names or something different?

  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    yes...a fofn...:-/

    OK. I'll implement a workaround. These limits should be posted in the documentation...

    @Ruchi I can do this, but the problem is that this will require delocalizing the file linecount times...but it is the simple solution. the more complex solution is to split the file into linecount files and to put the names of these files into a fofn...then return the fofn.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Holy carp that is a big fofn. How many filenames does it contain?

    The Cromwell team is looking into solutions to handle fofns more appropriately and avoid all this unpleasantness. Won't be for a while though (next quarter I would guess).

    And yes this limitation will be documented shortly -- it's in the Cromwell docs but we (on the WDL side) had not realized it was a thing. @ChrisL wrote a really good doc about this that I plan to poach.

  • yfarjounyfarjoun Broad InstituteDev ✭✭✭

    it should only have 901(intervals) * 500 (samples) files...that's ~450,000...multiply tht by the length of google paths....and you see why I have a 100Mb file.

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    @yfarjoun, based on the number of paths in this tsv, it seems the average file path is 222.78 bytes. So if each row of this tsv was written to individual files, assuming there are 500 paths written to each file, each file in the array would be 111,395 bytes on average. I'm assuming each of these files will eventually require a read_lines downstream to convert it from file --> array of files. While the estimated file size is <128K bytes, theoretically you should be able to read them but it's possible certain shards might be over the limit and fail the workflow. While you have the option to choose between the two solutions, I'd recommend doing the naive thing so that you don't encounter this limitation again. I totally understand its incredibly redundant to pass along the same file to 901 shards only to use a fraction of that information, but this tsv should be quick to localize and shouldn't take up a significant portion of disk space.

Sign In or Register to comment.