Ever wish you could automatically remove your unwanted output files from a submission without having to manually review them? If so, take this two minute survey and tell us more.
Latest Release: 1/10/19
Release Notes can be found here.

Frequently Asked Questions

Ilyana_RosenbergIlyana_Rosenberg Member, Broadie, Moderator admin
edited March 2018 in Firehose Migration

1. When is Firehose being decommissioned?

On July 1st, the Firehose system will be switched to “read-only mode”. Data will still be able to be read, but no new analyses can be launched. Two weeks later, Firehose will be fully shut down and the storage hardware supporting it will be taken offline.

2. Why is Firehose being decommissioned?

Nearly all of the hardware underlying the Firehose system is formally off warranty, old enough that it’s starting to fail, and would cost millions of dollars to replace. Such an expenditure on local infrastructure can not be justified given that we now have an operational cloud-based system for pipeline execution (FireCloud).

3. What data is what in Firehose / on prem storage?

For data that was sequenced at the Broad, there are multiple copies of the sequence data:

  1. Every “read group” (i.e. lane of sequencing for a sample) is stored in /seq/picard/
    *This is paid for by GP and is NOT generally meant to be used in analysis
  2. All of the read group data for a given sample is aggregated in /seq/picard_aggregation/
    *This is also paid for by GP and is meant to be used in analysis
  3. In some cases, groups may have reprocessed and are storing yet another copy of this data in their own Firehose workspaces
    *This is fully being paid for by those groups through their subscription

In addition, all of the outputs from the various downstream analysis done on the sequence data is all stored in a group’s Firehose workspace.

For data not sequenced at the Broad, it is just like any other data that a group stores (i.e. they have a separate storage subscription that they pay BITS for) completely separate from Firehose.

There is also a whole bunch of data in the Firehose Archive (see below).

4. What data is at risk of being lost due to the Firehose decommission?

It is just the storage backing Firehose itself that will be disappearing. The data from /seq/picard/ and /seq/picard_aggregation/ are not (yet) being decommissioned.

Note that the Firehose archive WILL be decommissioned with the rest of Firehose.

5. How will the PIs get their work done if they can no longer use Firehose?

We recommend that they migrate their analysis (and their data) to FireCloud, which is a Google-cloud-backed platform for large scale analysis developed by the DSP.

It is important to understand that FireCloud is not a one-for-one substitute for Firehose. The data model is different and migrating your methods to FireCloud involves modifying them to work in a way that's more native to FireCloud and also sets you up for long-term productivity in that model.

6. What are the options available to the PIs over the next few months?

At some point in 2018, Kathleen’s operations team will be migrating all of the cancer data in /seq/picard/ to the cloud and fully reprocessing it there against the newest version of the human reference genome. Therefore, for any projects which are not time sensitive and which don’t need to maintain older versions of the various files, the PIs may opt to do nothing for now. Note that when this data is eventually moved to the cloud, GP will pay for its storage.

For any projects that want to move the aggregated data (from /seq/picard_aggregation/) now, BITS will help. The PI should work with Eric Jones to get the list of sequence data to be moved, and his team will copy it into FireCloud workspace that GP pays for. Note that 1) the PI will have read-only rights to that data, 2) the existing copy in /seq/picard_aggregation/ will then be deleted, and 3) Kathleen’s team will still move and reprocess that data (from /seq/picard/) at some point in the future.

For any projects that want to move analysis data (i.e. not sequence data) from their Firehose workspace to FireCloud, we will help them do it. Most likely this help will be coming from the FireCloud field engineers.

7. Should the PIs actually be moving ALL of their data to the cloud?

Ultimately, we really hope that the answer is NO. The previous paradigm of maintaining all older versions of files just isn’t sustainable as the amount of data keeps increasing. In the end, this is the call of each individual PI since s/he will be the one paying for all the old copies, but we really want to impress upon them that it’s incredibly wasteful. The new paradigm of using docker images and WDLs means that we can always reproduce the older datasets if needed without actually having to store them. That’s the model we want everyone to move to.

8. How do people get stuff out of the Firehose Archive?

The Firehose archive talks to BOSS to get a pre-signed url and upload the tar of the job output dir there. The object name is the job_execution_id in Firehose. Basically an object in BOSS for Firehose is a tar file that can be expanded to recover Firehose job output dir. This is exactly what the unarchive does. The script for archiving is here and unarchive is here

Getting stuff out of archive is also a Firehose job by itself. User can invoke it from UI (job detail page) or an API call.

9. Not all of the tools/pipelines that were available to me in Firehose are accessible in FireCloud -- what should I do about that?

Several groups are working hard on migrating the pipelines into FireCloud (especially groups such as the Getz, Van Allen, and Beroukhim labs). And we will do our best to publicize how to find and use the pipelines that have already been migrated and vetted (more on that in the future). But, yes, it is possible that you will need to chip in and help migrate specific pipelines that your group relies on. Also of note: we are in the process of writing a new tool named Hydrant to help simplify this migration process, and we anticipate having a beta version available well before Firehose is turned off. In any case, the FireCloud Field Engineering team will happily work with you to assist however they can.

10. Can I transfer a workspace from Firehose to FireCloud?

Workspaces contain four types of information: summary information, data model, method configurations, and analysis history. While no tool exists that will do an automated transfer of an entire workspace, you can manually export the data model and import it into a new FireCloud workspace and point to the methods/pipelines you wish to use in the FireCloud methods repository. The analysis history unfortunately cannot be transferred to the new FireCloud workspace. The FireCloud Field Engineering team can help you in transferring any workspaces you wish to bring to FireCloud.

11. Who can I contact if I have additional questions?

You may post to this forum or reach out to [email protected] with specific questions and to get started migrating off of Firehose.

Post edited by Ilyana_Rosenberg on
Sign In or Register to comment.