loading page

Cloud-Native Repositories for Big Scientific Data
  • +9
  • Ryan Abernathey,
  • Tom Augspurger,
  • Anderson Banihirwe,
  • Charles C Blackmon-Luca,
  • Timothy J Crone,
  • Chelle L Gentemann,
  • Joseph J Hamman,
  • Naomi Henderson,
  • Chiara Lepore,
  • Theo A Mccaie,
  • Niall H Robinson,
  • Richard P Signell
Ryan Abernathey

Corresponding Author:rpa@ldeo.columbia.edu

Author Profile
Tom Augspurger
Anderson Banihirwe
Charles C Blackmon-Luca
Timothy J Crone
Chelle L Gentemann
Joseph J Hamman
Naomi Henderson
Chiara Lepore
Theo A Mccaie
Niall H Robinson
Richard P Signell

Abstract

Scientific data has traditionally been distributed via downloads from data server to local computer. This way of working suffers from limitations as scientific datasets grow towards the petabyte scale. A "cloud-native data repository," as defined in this paper, offers several advantages over traditional data repositories---performance, reliability, cost-effectiveness, collaboration, reproducibility, creativity, downstream impacts, and access & inclusion. These objectives motivate a set of best practices for cloud-native data repositories: analysis-ready data, cloud-optimized (ARCO) formats, and loose coupling with data-proximate computing. The Pangeo Project has developed a prototype implementation of these principles by using open-source scientific Python tools. By providing an ARCO data catalog together with on-demand, scalable distributed computing, Pangeo enables users to process big data at rates exceeding 10 GB/s. Several challenges must be resolved in order to realize cloud computing's full potential for scientific research, such as organizing funding, training users, and enforcing data privacy requirements.
03 Nov 2020Submitted to Computing in Science and Engineering
18 Jan 2021Published in Computing in Science and Engineering
01 Mar 2021Published in Computing in Science & Engineering volume 23 issue 2 on pages 26-35. 10.1109/MCSE.2021.3059437