2.3 Genome annotation
Structural annotation of the genome incorporates ab initio prediction, homology-based prediction, and RNA-Seq assisted prediction. For gene predication based on Ab initio, Augustus (v3.2.3) (Hoff & Stanke, 2019), GeneID (v1.4) (Parra, Blanco, & Guigo, 2000), Genescan (v1.0) (Aggarwal & Ramaswamy, 2002), GlimmerHMM (v3.04) (Majoros, Pertea, & Salzberg, 2004), and SNAP (2013-11-29) (Korf, 2004) were used in our automated gene prediction pipeline. Six species, Litopenaeus vannamei , Hyalella azteca , Eurytemora affinis ,Daphnia pulex , Drosophila hydei, and Bombyx mori , were used for homology-based prediction. Sequences of homologous proteins were downloaded from Ensembl and NCBI. Protein sequences were aligned to the genome using tBLASTn (v2.2.26; E-value ≤ 1e−5), and then the matching proteins were aligned to the homologous genome sequences for accurately spliced alignments using GeneWise (v2.4.1) software (Birney, Clamp, & Durbin, 2004). To optimize the genome annotation, the RNA-Seq reads from different tissues (NCBI BioProject: PRJNA558194) were aligned to the genome. Hierarchical indexing for spliced alignment of transcripts (HISAT; v2.0.4) (Kim, Langmead, & Salzberg, 2015) and TopHat (v2.0.11) (Cole Trapnell, Pachter, & Salzberg, 2009) were used with default parameters to identify exons and splice positions. The alignment results were then used as input for Stringtie (v1.3.3) (Pertea et al., 2015) and Cufflinks (v2.2.1) (C. Trapnell et al., 2010) with default parameters for genome-based transcriptome assembly.
Gene functions were assigned according to the best match by aligning the protein sequences to the SwissProt database using BLASTp (Altschul et al., 1997) (E-value ≤ 1e−5). The motifs and domains were annotated using InterProScan70 (v5.31) (Mulder & Apweiler, 2007) by searching against publicly available databases, including ProDom, PRINTS, Pfam, simple modular architecture research tool (SMART), PANTHER and PROSITE. The GO IDs for each gene were assigned according to the corresponding InterPro entry. We also mapped the gene set to the KEGG pathway database and identified the best match for each gene.