loading page

scFlow: A Scalable and Reproducible Analysis Pipeline for Single-Cell RNA Sequencing Data
  • +3
  • Combiz Khozoie,
  • Nurun Fancy,
  • Mahdi M. Marjaneh,
  • Alan E. Murphy,
  • Paul M. Matthews,
  • Nathan Skene
Combiz Khozoie
UK Dementia Research Institute, Imperial College London, Department of Brain Sciences, Imperial College London, United Kingdom

Corresponding Author:c.khozoie@imperial.ac.uk

Author Profile
Nurun Fancy
UK Dementia Research Institute, Imperial College London, Department of Brain Sciences, Imperial College London, United Kingdom
Author Profile
Mahdi M. Marjaneh
UK Dementia Research Institute, Imperial College London, Department of Brain Sciences, Imperial College London, United Kingdom
Author Profile
Alan E. Murphy
UK Dementia Research Institute, Imperial College London, Department of Brain Sciences, Imperial College London, United Kingdom
Author Profile
Paul M. Matthews
UK Dementia Research Institute, Imperial College London, Department of Brain Sciences, Imperial College London, United Kingdom
Author Profile
Nathan Skene
UK Dementia Research Institute, Imperial College London, Department of Brain Sciences, Imperial College London, United Kingdom
Author Profile

Abstract

Advances in single-cell RNA-sequencing technology over the last decade have enabled exponential increases in throughput:   datasets with over a million cells are becoming commonplace.   The burgeoning scale of data generation, combined with the proliferation of alternative analysis methods,  led us to develop the scFlow toolkit and the nf-core/scflow pipeline for reproducible, efficient, and scalable analyses of single-cell and single-nuclei RNA-sequencing data.  The scFlow toolkit provides a higher level of abstraction on top of popular single-cell packages within an R ecosystem, while the nf-core/scflow Nextflow pipeline is built within the nf-core framework to enable compute infrastructure-independent deployment across all institutions and research facilities.  Here we present our flexible pipeline, which leverages the advantages of containerization and the potential of Cloud computing for easy orchestration and scaling of the analysis of large case/control datasets by even non-expert users.  We demonstrate the functionality of the analysis pipeline from sparse-matrix quality control through to insight discovery with examples of analysis of four recently published public datasets and describe the extensibility of scFlow as a modular, open-source tool for single-cell and single nuclei bioinformatic analyses.