Ever wish you could automatically remove your unwanted output files from a submission without having to manually review them? If so, take this two minute survey and tell us more.
Latest Release: 1/17/19
Release Notes can be found here.

Tabix-incompatible file creation times

ericco92ericco92 Cambridge, UKMember

I have a workflow that indexes a vcf.gz to produce a tabix index (tbi). A common sanity check when reading from a VCF index is to verify that the index was created more recently than the VCF.

It seems like gsutil sets file creation times on copy, rather than when they're actually created by my workflow.

Since the index is often < 100X smaller than the VCF, it'll almost always get copied first. Here are the file sizes and creation times for my two files in the GS bucket:
VCF: 150789555 2018-06-05T13:42:39Z
VCF index: 1616957 2018-06-05T13:42:34Z

When I copy the index to my local machine, the creation time and sizes are:
VCF index: 1.6M Jun 7 14:02

Is there a way to ensure the creation times actually reflect when the file was created, not copied?

As a hack, I'd be willing to run some sort of gsutil touch command to reset the file times, but I don't see how that might work. Perhaps gsutil setmeta?

Tagged:

Best Answer

Answers

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Hi @ericco92, thank you for your patience. I'm looking into possible solutions for you at the moment.

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    After speaking with the developers, there isn't currently a way to keep the original file creation times in Cromwell. If possible, I would recommend ignoring that warning message or disabling it if the tool you use will not run without the file times being correct.

    I can put in a feature request to have this implemented if you'd like. From discussion with the team, though, this type of feature would be very complicated to implement. As such, it wouldn't be done any time soon.

    Please let me know if you'd like me to put in a feature request, or if you're able to use one of the other workarounds suggested (ignore, disable, or a gsutil command).

  • ericco92ericco92 Cambridge, UKMember

    Hi Kate,

    As a hack I just copied the file I needed to a VM and copied it back to reset the times. It's not terribly expensive - I can usually use preemptibles - but it feels a little gross.

    There's no way to really adapt the tools I'm using to handle indices that are older than files, as it's a "feature" of the htslib core.

    Would it be possible to have runtime attributes that determine the order in which files are copied? I assume they get queued up somehow, but gsutil might be handling this.

Sign In or Register to comment.