2.4.2. Capture
Raw reads were trimmed for Illumina adapters using Trimmomatic (v 0.38;
Bolger, Lohse, & Usadel, 2014) and then quality-filtered with
PRINSEQ-lite PERL script (min_qual_mean =25, trim_qual_window=3,
trim_qual_step=1, min_len=60; Schmieder & Edwards, 2011). Trimmed
reads corresponding to rDNA were extracted using SortMeRNA (v2.1;
Kopylova, Noé, & Touzet, 2012) with default parameters.
Near-full-length 16S and 18S rDNA sequences were reconstructed using
EMIRGE software (v 0.60; Miller, Baker, Thomas, Singer, & Banfield,
2011) and the emirge_amplicon.py script. This tool allows
reference-based assembly of reads while allowing the reconstruction of
distant variants. The database used was SILVA 132 SSURef NR99, including
fragments with lengths from 1200-2000 bp. The parameters used were
join_threshold fixed to 1 and 120 iterations. Only sequences longer
than 800 bp were kept. Taxonomic affiliation was performed using the
plugin “feature-classifier sklearn classifier” from QIIME2 (v. 2019.1;
Bokulich et al., 2018; Bolyen et al., 2019) and the full-length SILVA
132 database, with the p-confidence set to 0.7. This type of analysis is
further referred to as CBH-long.
Additionally, Kraken2-based analysis (Wood, Lu, & Langmead, 2019; Wood
& Salzberg, 2014) was performed starting from paired reads to evaluate
all captured diversity (without gene reconstruction), as too low
coverage of some taxa could hinder the possibility of reconstructing
longer sequences and thus cause the lack of these taxa in the final
dataset. The database used was the prepackaged SILVA database provided
by Kraken2. We tested the confidential score from 0.0 to 1.0 with 0.1
steps. For the final analyses, a score of 0.7 was retained, ensuring
good specificity of taxonomic affiliation. This is in line with a
previous report that values from 0.6 up to 0.7 indicated the best
results for sensitivity and precision (Wood & Salzberg, 2014). Data
related to these analyses are further mentioned as CBH-short.