Abstract
Scientific data has traditionally been distributed via downloads from data server to local computer. This way of working suffers from limitations as scientific datasets grow towards the petabyte scale. A "cloud-native data repository," as defined in this paper, offers several advantages over traditional data repositories---performance, reliability, cost-effectiveness, collaboration, reproducibility, creativity, downstream impacts, and access & inclusion. These objectives motivate a set of best practices for cloud-native data repositories: analysis-ready data, cloud-optimized (ARCO) formats, and loose coupling with data-proximate computing. The Pangeo Project has developed a prototype implementation of these principles by using open-source scientific Python tools. By providing an ARCO data catalog together with on-demand, scalable distributed computing, Pangeo enables users to process big data at rates exceeding 10 GB/s. Several challenges must be resolved in order to realize cloud computing's full potential for scientific research, such as organizing funding, training users, and enforcing data privacy requirements.