Jupyter notebooks integration: bringing interactive analysis to FireCloud
FireCloud has so far focused on making it easy to share data and execute pipelines, for all your batch-style --one might even say boring-- data processing and analysis needs. That covers a lot of territory; certainly most of the upstream work that is done under the umbrella of "genomic analysis". But for many of you, the end of the pipeline is just the start of the really interesting part; and whether you're moving on to GWAS or something else, the point is that you're moving on -- specifically, to a phase of analysis where you need to be able to interrogate your data interactively. And until now, that often meant downloading your pipeline's outputs and getting back to traditional, on-premises computational work and its limitations.
So today, we're excited to release a beta preview of a new FireCloud feature called Notebooks that makes it possible to run interactive analyses on your data in the cloud, with the convenience of a Jupyter notebook environment.
If you're not already familiar with them, Jupyter notebooks (formerly iPython) have become increasingly popular as a device for working with data interactively in a way that captures the narrative of the analysis. This is because they allow you to both execute code and embed text that describes what the code is doing, in a way that goes well beyond mere documentation comments. The main goal is to emulate a laboratory notebook as used in a wet lab, where the experimenter writes down the details of every step of every procedure (including, importantly, all those that failed) in order to enable reproducibility.
Using a Jupyter notebook in FireCloud involves two things: the notebook itself, and a Spark cluster. The notebook is ultimately just a text file in which you can embed sections of code, as well as pointers to data files that are stored on the cloud. The Spark cluster is cluster of virtual machines that FireCloud will create for you on demand, on which you will "launch" your notebook -- and that's where any code you run in your notebook will actually get executed. We use Spark clusters because they offer massive scalability, especially in combination with toolkits like Hail that are designed to take advantage of Spark's extraordinary capacity for parallelism. That being said, you don't need to know anything about Spark in order to use this feature, and if your needs are modest, you can set it to use just a single VM.
To get started with Notebooks, check out this overview, then head over to this tutorial which will give you step by step instructions and links to example notebook files. Keep in mind that this is still a beta release, so there are some limitations, which are detailed here. We're actively working on removing those limitations and further improving the Notebooks feature, so if you give it a try, please tell us how it goes and what we can do to make it better!