loading page

Big Data Analytics to Enable Integrated Research of Biodiversity and Climate Datasets in the Amazon Basin
  • +8
  • Pedro Luiz Pizzigatti Corrêa,
  • Giri Prakash,
  • Mike Frame,
  • Bhargavi Krishna,
  • Luciana Rizzo,
  • Ricardo Oliveira,
  • Wesley Barbosa,
  • André Batista,
  • Paulo Artaxo,
  • Solange Alves-de-Souza,
  • Katia Ferraz
Pedro Luiz Pizzigatti Corrêa
University of Tennessee

Corresponding Author:pedro.correa@usp.br

Author Profile
Giri Prakash
Oak Ridge National Laboratory
Author Profile
Mike Frame
USGS Headquarters
Author Profile
Bhargavi Krishna
Oak Ridge National Laboratory
Author Profile
Luciana Rizzo
Universidade Federal de São Paulo
Author Profile
Ricardo Oliveira
Universidade de São Paulo
Author Profile
Wesley Barbosa
Universidade de São Paulo
Author Profile
André Batista
Universidade de São Paulo
Author Profile
Paulo Artaxo
USP University of Sao Paulo
Author Profile
Solange Alves-de-Souza
Universidade de São Paulo
Author Profile
Katia Ferraz
Universidade de São Paulo
Author Profile

Abstract

With the mass adoption of data analysis in several scientific fields such as climatology, medicine, astronomy and astrophysics, the availability of an appropriate analytics infrastructure has become a necessity increasingly recognized by the scientific community. However, appropriate tools and applications are required to process the large volume of data collected and generated by researchers. One of the biggest challenges lies in the fact that these tools need to be gathered to be applied in specific domains. The area of bioclimatic data is a scientific field that still has much to improve in this matter. It is a field of study that lacks great efforts in the direction to provide methodologies and tools to facilitate the understanding of the complex phenomena involved in the influence that environmental variables have on biodiversity on the planet. Thus, the purpose of this work is to propose a big data analytics architecture that presents an ecosystem that systematizes and facilitates the task of the scientists to deal with the complexity in the bioclimatic data analysis, providing tools for storage, management, analysis using machine learning algorithms and data mining, and visualization tools. The methodological approach of this work was to make a thorough bibliographical study to verify the most used tools and the suitability of each one to the purpose of the work. In addition, the literature provided indications of software ecosystem implementations methodologies that served as a guide in the architecture design. Within the architecture, we attempted to gather a set of bioclimatic data based on a subset of data obtained from the Atmospheric Radiation Measurement (ARM) data repository for climatic data, and the Brazilian Biodiversity Portal for biodiversity data. As a result, we were able to gather a series of tools to access data such as Cassandra, distribution of processing such as Spark, programming interface represented by Jupyter Notebook, system modules for data format conversion, machine learning algorithms libraries and software for data visualization. This research discuss the importance of a domain purpose design of a data analysis architecture for bioclimatic data. We concluded that this type of ecosystem is imperative to facilitate the research process and increase the quality of the results.