(howto) Set up, use, and shut down a Notebook
This feature is currently released in a Beta status to give you a chance to try it. Tell us what you like or struggle with on the forum.
- Upload a notebook to the workspace
- Create a cluster
- Work in your notebook
- Save your work and pause or delete your cluster
- Tips for long-running computations
- Jupyter terminal
1. Upload a notebook
Upload a notebook by using the Upload Notebook button in the Notebooks tab. This will open a file browser:
For this example, we will select the LeonarDemo.ipynb notebook. Other example notebooks can be found in the Leonardo GitHub repository.
This will upload the notebook to your workspace bucket. You can also create a brand new notebook using the Create Notebook button.
2. Create a cluster
Once you upload a notebook, you will see it displayed in the table like this:
You can Rename/Duplicate/Delete notebook files by expanding the menu to the left of the notebook name.
The next step is to create a cluster with which you can open and run the notebook. Select Create... and you will be presented with a dialog window where you can configure the Google Dataproc cluster:
- Name -- a unique name for your cluster
- Extension URI -- Optional bucket URI to an archive containing Jupyter notebook extension files. This allows you to customize the look and feel of a Jupyter notebook. The archive must be in tar.gz format, must not include a parent directory, and must have an entry point named
main. For more information on notebook extensions, see Jupyter documentation.
- Master Machine Type -- Google machine type of the Spark master node.
- Master Disk Size -- disk size in GB of the Spark master node. The minimum value is 100.
- Workers -- number of Spark workers. Set to 0 for a single-node cluster. Set to 2 or more otherwise. Note that you cannot have 1 worker; you must have 0, or >=2. This is a Google thing, we don't know why...
- Worker Local SSDs -- the number of local SSDs to attach to each worker in a cluster.
- Preemptible Workers -- the number of preemptible worker nodes in a cluster. You must have at least 2 non-preemptible workers if you want preemptibles as well.
Once you click Create... you'll be brought back to the notebook list page, where you should see your cluster spinning with Creating status:
You can view details about the cluster by clicking the gear icon and selecting Cluster Details. You can associate the notebook with a different cluster by selecting Choose.
Cluster creation typically takes about 3-5 minutes. Once it's ready, it will be displayed with Running status and the cluster name will turn into a link:
If an error occurs the cluster will be displayed with Error status and there will be a link to a Google Cloud Storage (GCS) path to obtain logs.
3. Work in your notebook
Click on the cluster link. This will open the Jupyter Notebook in a new tab:
Start writing code! Jupyter gives you a cell-based coding environment and allows mixing code and markdown. We support Python 2, Python 3, and R kernels. Additionally, we support PySpark 2 and PySpark 3 kernels which can be used to run Hail 0.1 and Hail 0.2, respectively. We pre-install a number of useful data science/bioinformatics libraries, including
bxpython on the python side and the
tidyverse family on the R side.
If you wish to install something else, you can easily do so from a notebook:
Custom Script URI
It is also possible for you to specify a script that runs at cluster creation time as another way to customize the notebook environment. When creating the cluster click “Optional Settings” and set the “Custom Script URI” field to the Google bucket path to the script (For examples: gs://gatk-tutorials/scripts/install_gatk.sh)
Using FISS API
Note: FireCloud notebooks use Hail version 0.1. You can use Hail on a PySpark 2 notebook in FireCloud, not Python 2 as the screenshot displays. This screenshot gives you an idea of how you can read your data from a workspace bucket and run basic Hail commands.
4. Save your work and pause or delete your cluster
Notebooks will auto-save every two minutes, and you can also save by clicking File -> Checkpoint and Save (or Ctrl-S). A saved notebook contains code cells and results, including images. When a notebook is saved in the Jupyter user interface, it is automatically saved back to your workspace bucket. There is no need to back up copies of your notebook to your computer (although you can if you wish). Currently, only notebook (.ipynb) files are saved to the bucket; any other files will stay local to the cluster.
You'll be paying for your cluster as long as it's running, even if you're not doing anything with it, and the costs will add up, so don't leave your cluster running for long periods of time! You can pause and resume your clusters by clicking the pause and play icons in the FireCloud Notebooks tab. When your cluster is paused, you pay for storing the persistent disk in Google Cloud Storage, which is on the order of a single-digit $/month for a default disk of 100GB. (A larger disk would incur higher storage costs.) Resuming your cluster only takes a couple of seconds, so it is a bit faster than deleting your cluster and starting from scratch. Overall pause and resume is a nice option if you're willing to pay a few pennies to keep your progress when you leave work for the night and know you'll be back at it the next day.
When you're done working for a while and you know you won't come back to your notebook for a few days or more (or ever!), the best thing to do is delete the cluster. It'll be easy enough to spin up a new cluster when you're ready to get back to work. You can delete your cluster in the FireCloud notebooks tab by clicking the trash icon next to your cluster.
To help save costs, FireCloud will automatically pause any clusters that are idle for longer than 30 minutes, unless otherwise specified at creation time. Just as when you pause the cluster manually, auto-pausing retains all data stored on the cluster, and there is a very small fee associated with the data persistence (see previous section). This helps protect against incurring costs in case you forget to pause or delete a cluster when you’re done using it. If you wish, you can modify or disable the auto-pause threshold at cluster creation time. Note that if a notebook is open in a browser tab (even if it's not in a tab you are actively looking at or it's in the background), then the notebook is not considered idle.
6. Tips for long-running computations
Jupyter doesn’t support running notebooks in the background. When you kick off a long running cell in a notebook, it will keep running as long as the notebook is open in a browser tab. However, if you close the tab (or your laptop goes to sleep), the task will continue running on the cluster, but it won't report the result back to the notebook cell. Because of this, a best practice for long running tasks is to write the result to file somewhere like the local filesystem on the master node, Hadoop Distributed File System (HDFS), or GCS. That way you should still be able to retrieve results if you lose connectivity to your notebook.
It is also advisable to disable or increase the auto-pause threshold if you need to run long-running tasks. Pausing a cluster will kill any computations in progress. The auto-pause threshold is currently configurable at cluster creation time. In the future we plan to make this configurable without needing to recreate your cluster.
7. Jupyter Terminal
In addition to notebooks, Jupyter also provides an in-browser bash terminal. This can be useful in general for running bash commands/scripts and exploring data. To access it, click the Jupyter logo in the upper-left of your notebook:
This will bring you to the Jupyter tree view. Then click New -> Terminal to access the terminal:
Note however, that nothing you type in this terminal will be captured in the notebook.