The Pangeo Platform: a community-driven open-source big data environment

Joseph Hamman; Scott Henderson; Anthony Arendt; Amanda Tan; Dennis Fatland; Andrew Pawloski; Daniel Pilone; Matthew Hanson; Tom Augspurger; Ryan Abernathey; Richard Signell

doi:10.1002/essoar.10501751.1

loading page

The Pangeo Platform: a community-driven open-source big data environment

Joseph Hamman,
Scott Henderson,
Anthony Arendt,
Amanda Tan,
Dennis Fatland,
Andrew Pawloski,
Daniel Pilone,
Matthew Hanson,
Tom Augspurger,
Ryan Abernathey,
Richard Signell

Abstract

In this presentation, we will describe the [Pangeo Project](http://pangeo.io), a coordinated community effort with support from NASA, NSF, AWS, Microsoft Azure and Google Cloud, to develop interactive and reproducible open source workflows for discovery, visualization, and quantitative analysis of large datasets used for research in the Earth Sciences. The Pangeo computational platform is based on JupyterHub and deployed wherever the data is stored. Python libraries such as Xarray, Rasterio, and Dask enable distributed parallel computations on HPC and Kubernetes clusters. We will discuss the design concepts central to the Pangeo platform and highlight specific applications using NASA satellite data archives on AWS. We will discuss recent progress in the integration of data discovery tools (e.g. STAC, CMR, Intake) with cloud-native storage formats for multidimensional data types (Cloud-Optimized Geotiff, Zarr, etc.) and highlight how they can be used to construct elegant, robust and reproducible scientific workflows. Finally, we will discuss performance, security, transferability across public cloud platforms, cost to operate, and approaches to encourage a cultural shift in scientific computation through educational events.