2.3 Bioinformatics processing to identify OTUs at different thresholds of genetic similarity
The resulting paired-end reads of the 42 samples were quality filtered following procedures described by Arribas et al. (2020). Briefly the processing included quality checking, primer removal, pair merging, quality filtering, denoising, and clustering each library independently. We checked raw reads quality with fastqc(https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). We trimmed primers using fastx_trimmer option of thefastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) to trim the primer sequences (20 and 26 bases for R1 and R2, respectively). Then, we processed reads in trimmomatic-0.36(Bolger et al., 2014) using TRAILING:20 to cut-off bases of the end of a read, if below a threshold quality of 20. We used R1 and R2 reads to search paired sequences with pairfq-0.17(Staton, 2013) and we merged paired reads overlapping with fastq_mergepairs command,-fastq_minovlen 50 and -fastq_maxdiffs 15 inusearch-9.2(Edgar, 2013). We used quality-filtered (Maxee = 1), dereplicated (-fastx_uniques ) and sorted (-sortbylength ) options to keep only reads of 418 pb in usearch-10(Edgar & Flyvbjerg, 2015). Surviving reads were denoised to generate zero-radius OTUs (ZOTUs) with the unoise3 and -minsize 4 commands. ZOTUs are equivalent to Amplicon Sequence Variants (ASVs; sensu Callahan et al., 2016), and are proposed to be a set of predicted biological sequences to be used for direct analysis without the need of OTU clustering (Callahan et al., 2017; Edgar, 2016).
We then assigned high-level taxonomic categories in all of the reads using the lowest common ancestor (LCA) algorithm of MEGAN-6(Huson et al., 2016). Taxonomic identification of each read was done using BLAST against thenucleotide NCBI nt database (June 06 2018; blastn -outfmt 5 -evalue 0.001). We fed BLAST matches into MEGAN(Huson et al., 2016), and the taxonomic assignments were used to extract eight ASV datasets, each one of the following orders: Diptera, Collembola, Arachnida, Coleoptera, Hymenoptera, Hemiptera, Myriapoda, Lepidoptera. Remaining sequences were no further considered. We exported the tree (in newick format), visualised and edited it using figtree-1.4.3 (Rambaut, 2012). Each ASV dataset was aligned in geneious-8.0.2 with MAFFT, using the FFT-NS-1 algorithm, a scoring matrix of 200/PAM/K=2, GAP open penalty of 3, and the Translation align option. We reviewed all sequences for insertions, deletions or stop codons disrupting the reading frame, which were afterwards excluded.
Subsequently, we generated a community table with read-counts (haplotype abundance) of each retained ASV for the eight orders by matching ASVs against the complete collection of reads (i.e., reads before the dereplicating and denoising steps) using -search_exact command with usearch -10(Edgar & Flyvbjerg, 2015). Additional filtering according to AVS abundances in community tables (one ASV per taxa) was performed as described by Arribas et al., (2020). Shortly, we first removed from each library those haplotypes with abundances of four or fewer reads (same criteria than denoising). Second, we identified haplotypes contributing with less than 1% of the total reads of the library where they were present, and removed them from the analysis. The 1% cut-off value has been seen in similar datasets to remove most of the spurious ASVswhile maximizing the number of real haplotypes (Andújar et al., 2020; Arribas et al., 2020). Lastly, the community tables of filtered haplotypes were transformed into incidence (presence/absence) data and used for downstream analyses.
Finally, we defined lineages at different clustering levels for each of the orders. For this, each of the ASV filtered dataset was used to generate an UPGMA tree with corrected distances under a F84 model, and based on this tree, all haplotypes were nested into clusterin levels (CLs) following the genetic similarity at different thresholds (0.5%, 1.5%, 3%, 5% and 7.5%), plus an additional threshold corresponding to the result of a species delimitation analyses conducted with the generalized mixed Yule-coalescent (GMYC) model in R(Pons et al., 2006). The analyses were performed using vegan (Oksanen et al., 2013),cluster , PMCMR , hier.part , ecodist , andbetapart(Baselga & Orme, 2012).