REEMA JOSHI

and 1 more

Single-cell RNA Sequencing (scRNA-seq) has, in the recent past given insights on macroscopic level organism behavior through a study of cellular-level gene expression patterns. Owing to the noisy and biased nature of scRNA-seq data, normalization becomes an essential operation, for which, no gold standard method exists to date. This paper proposes a pipeline for normalization and compression of single-cell RNA-seq data. The library size per sample in scRNA-seq data can alone occupy 2GB to 3GB storage, which across thousands of cells can take up a very large amount of space. This introduces the need for compression. Often normalization is done as a stand-alone operation, but we propose compression as an almost equally necessary operation before downstream analysis. There are, so far, very few methods for compression of scRNA-seq data, and none, to the best of our knowledge, that integrate normalization and compression. Normalization was introduced owing to technical biases leading to unwanted consequences like batch effects, high dropout rates and skewed distributions of gene expression across cells. Taking this into account, the proposed method aims to account for the uneven distribution of zeroes in scRNA-seq data. These issues lead to uneven distribution of zeroes in scRNA-seq data, and the proposed method aims to address this. Our integrated pipeline serves as a unified approach in keeping the data pre-processed and summarized, before further steps in the downstream analyses (differential expression, co-expression) are performed.