loading page

A Review of Data Structures for Data Science
  • +7
  • Fernando Perez,
  • Jey Kottalam,
  • Kyle Barbary,
  • Awaiting Activation,
  • Kathryn Huff ,
  • Daniel Turek,
  • Nathaniel Smith,
  • zhangzhao,
  • Dav Clark,
  • Stéfan van der Walt
Fernando Perez

Corresponding Author:fperez@lbl.gov

Author Profile
Jey Kottalam
Author Profile
Kyle Barbary
Author Profile
Awaiting Activation
Author Profile
Kathryn Huff
Author Profile
Daniel Turek
Author Profile
Nathaniel Smith
Author Profile
Stéfan van der Walt
University of California, Berkeley
Author Profile

Abstract

Data structures are the foundation upon which computational tools are built. For example, the simple pointer-to-memory approach, established by languages such as Fortran and C, acts as a de facto standard by which different packages and libraries can interoperate with a single shared array of numerical data in memory. While this simple abstraction for n-dimensional arrays has served us well in the past, there is a clear need for data structures that have richer semantics and make it easy to express and manipulate common forms of (semi-)structured data. This need is highlighted by the popularity of R’s data frames and Python libraries, such as bcolz (column storage), pandas (indexed data frames), and X-ray (n-dimensional indexed arrays).

This paper aims to present the state of the art in data structures, across programming languages and implementation details, that are foundational in data science, scientific computing, and statistical applications. It will review current data representation semantics implemented by various libraries, packages, and languages, with an explicit emphasis on interoperability across languages and process boundaries.