Abstract
1. Accurate differential expression of microbial metatranscriptomes
based on Next Generation Sequencing depends partly on the depth of the
libraries used to perform the analysis. Therefore, estimating the
sequencing depth required to sample the metatranscriptome of interest
using RNA-seq effectively is an essential first step to both obtain
robust results in further analysis and avoiding over-expending once the
information contained in the library reaches saturation. 2. Here we
present a method to calculate the effort in saturation curves and a
priori genes prediction using a simulated series of
metatranscriptomic/metagenomic matrices. This method is based on the
extrapolation rarefaction curve using a Weibull growth model to estimate
the maximum number of genes/OTUs as a function of sequencing depth using
a machine learning approach. This approach allows us to compute the
effort at different confidence intervals and to obtain an approximate a
priori effort using based on an initial fraction of sequences. 3. The
accuracy of the results obtained with simulations and real samples (15
datasets of metatranscriptomes from the oral cavity, RNA sequences
consist of vectors of 105-1.5x107 reads depth with a 10000 and 600000
genes size) allows one to use an initial shallowly sequenced sample (in
this case 20% of the total amount of reads sampled; accuracy
R2>0.99 simulated samples and 60-93% for real samples) to
estimate the expected sequencing effort needed to cover the whole
metatranscriptome/ metagenome from the same sample, so can be used to
estimate the estimate the sample size. The algorithm containing the
proposed method was saved as a function for R. 4. This proposed method
of estimation of the maximum number of gene/OTUs, reads to reach 90, 95
and 99% of maximum number of gene/OTUs, is efficient to help
researchers to know if the sampling is sufficient or otherwise need to
be increased.