Use Cases

We noticed some common patterns in our engagements with scientific users that use Jupyter for their computational workflows on NERSC systems. At the highest level there is a need for combining exploration of very large datasets with some computational and analytical capabilities. Crucially the scale of data or compute (or both) required to enable these workflows typically exceeds the capacity of the users own machines and the users need a user-friendly way to drive these large-scale workflows interactively.
We often see a two phased approach, where the user performs some local notebook development and then runs these on machines like NERSC on their production data and compute pipelines. It is important to be able to seamlessly go between these modes and our approach is grounded in trying to make sure that a user can easily take a notebook and its associated environment over to our systems, with minimal effort and making sure that they have a consistent user experience.
As an example, we describe a use case [ref: Heagy et al.] applying geophysical simulations and inversions for imaging the subsurface. This was done by running 1000 1D inversions that each produces a layered model of the subsurface conductivity, which are then stitched together to create a 3D model. The goal of this particular survey was to try and understand why the Murray River in Australia was becoming more saline. This involved running simulations, data analysis, and machine learning ML on HPC systems. The outputs of these runs need to be visualized and queried interactively. The initial workflow was developed on the users local  laptop environment, and needed to be scaled up at NERSC.
In practice this involves running Jupyter at NERSC in a Docker container with a pre-defined reproducible software environment. Parallel computing workers are launched on Cori from Jupyter with the "Dask-jobqueue" Jupyter extension. Workers can be scaled up or down on-demand. The SimPeg Inversion notebook farms out parallel tasks to Dask - the results of these parallel runs are pulled into notebook and visualized. Running of a large batch of simulations is then used to generate data for a machine learning application.