Abstract
In this presentation, we will describe the [Pangeo
Project](http://pangeo.io), a coordinated community effort with
support from NASA, NSF, AWS, Microsoft Azure and Google Cloud, to
develop interactive and reproducible open source workflows for
discovery, visualization, and quantitative analysis of large datasets
used for research in the Earth Sciences. The Pangeo computational
platform is based on JupyterHub and deployed wherever the data is
stored. Python libraries such as Xarray, Rasterio, and Dask enable
distributed parallel computations on HPC and Kubernetes clusters. We
will discuss the design concepts central to the Pangeo platform and
highlight specific applications using NASA satellite data archives on
AWS. We will discuss recent progress in the integration of data
discovery tools (e.g. STAC, CMR, Intake) with cloud-native storage
formats for multidimensional data types (Cloud-Optimized Geotiff, Zarr,
etc.) and highlight how they can be used to construct elegant, robust
and reproducible scientific workflows. Finally, we will discuss
performance, security, transferability across public cloud platforms,
cost to operate, and approaches to encourage a cultural shift in
scientific computation through educational events.