2.6 Similarity in gene expression among samples
To assess the variation and direction of variation among samples based
on their gene expression, we calculated the correlation of gene
expression levels among samples and the Euclidean distances among
samples in DESeq 2 (version 1.22.2; Love et al., 2014) following the
program directions. These measures are especially useful to assess the
similarity of biological replicates (e.g., samples belonging to the same
group) (Koch et al. 2018) and therefore to detect anomalies among the
samples. The sample correlation matrix was calculated by performing the
Pearson correlation of the normalized matrix after the variance
stabilizing transformation (vst ) was performed on the most
variable 2000 genes based on the HTSeq data produced. vst allows
taking into account the sample variability of low counts. Sample Pearson
correlation is calculated in pairwise comparison among samples and
ranges from -1 to 1, where a value of 0 indicates no correlation (gene
expression is completely dissimilar between the two samples), while
values of -1 and 1 indicate that the samples have identical expression
level (-1 corresponding to negative correlation). The Euclidean distance
between samples was calculated by this equation: dist = sqrt(1-
cor2) , where cor stands for the correlation
coefficient of 2 samples. The smaller the distance, the higher the
correlation among samples is. These distances were then used to build
the heatmaps of sample distance of each normalized matrix, which allows
the data to be shrunken towards the genes’ average expression across all
samples. Gene heatmaps are instead based on vst transformation to
normalize the raw count. After this, the mean expression in each sample
is then normalized to 0. Finally, differences in gene expression among
the studied groups (see below) were visualized by a PCA plot using the
gene count matrix after applying the variance stabilizing transformation
(vst ) to normalize the raw counts. PCA plots are useful to assess
the effect of covariates and batch effects (non-biological variation due
to experimental artifacts (Reese et al. 2013).