2.5 Gene annotation
To predict the repetitive regions, RepeatMasker (version 4.1.1) (Tarailo-Graovac & Chen, 2009) was employed to screen the S. chinensis genome against the Repbase library (Bao, Kojima, & Kurtz, 2015), and the parameter was set to RepeatMasker -pa 4 -e ncbi -species Hemiptera ch -dir. Further, an aphid- specific database was generated using RepeatModeler (version 2.0.1, with default parameters), so as to predict the transposons and repetitive regions (Flynn et al., 2020). Statistical results of RepeatMasker and Repeatmodeler analyses were combined.
Gene structures were predicted using GETA pipeline (version 2.4.2, https://github.com/chenlianfu/geta) to merge the results of the RNA-seq assisted, homology-based and ab initio methods. Briefly, In the RNA-seq assisted method, RNA-seq data generated from Illumina were aligned to the assembled S. chinensis genome using Hisat2 (version 2.1.0.5) (Kim et al., 2015). In the homology-based method, genes were predicted based on homology to map protein sequences using GeneWise (version 2.4.1) (Birney, Michele, & Durbin, 2004). Augustus (version 2.5.5) (Stanke et al., 2006) was used to generate ab initio gene prediction (Stanke et al., 2006; Blanco, Parra & Guigó, 2007). Gene prediction results were then pooled and screened against the PFAM database.
To assign functions to the newly annotated genes in the S. chinensis genome, these genes were aligned to sequences in databases including NCBI Non-Redundant Protein Sequence (Nr), Non-Redundant Nucleotide Sequence Database (Nt), SwissProt, Cluster of Orthologous Groups for eukaryotic complete genomes (KOG), Integrated Resource of Protein Domains and Functional Sites (InterPro), Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes, Orthology database (KEGG), and evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG). A localBlast2GO database was also built for GO annotation, which was later processed via Blast2GO (version 2.5). The KAAS of KEGG databases were utilized to annotate the S. chinensis genome sequence, and then BBH pattern was chosen.