2.4 Genome annotation

Repetitive elements were predicted in the genome ofS. tetraptera . We used TRF (Benson, 1999) and MISA (Thiel, Michalek, Varshney, & Graner, 2003) to identify the tandem repeats and simple sequence repeats (SSRs), respectively. Transposable elements (TEs) were then identified based onde novo and homology-based strategies. RepeatMasker (Tarailo-Graovac & Chen, 2009) v4.0.7 was used to run a homology search for known repeat sequences against the Repbase database v22.11 (Jurka et al., 2005). RepeatModeler (Jurka et al., 2005) v2.0.10 was employed to predict the TEs based on the de novo method. Finally, all identified repetitive elements were merged for subsequent analyses.
Protein-coding genes were then predicted in the repeat-masked S. tetraptera genome based on integrated strategies. The RNA-seq reads derived from the seven tissues (TableS2) were assembled using Trinity v2.6.6 (Grabherr et al., 2011) in the de novo -based and genome-guided modes, respectively. For transcriptome-based prediction, the assembled transcripts produced in the two different ways were combined and further aligned to the genome by PASA v2.1.0 to obtain the gene structures. For homology-based prediction, protein sequences of seven species (Arabidopsis thaliana (Kaul et al., 2000),Vitis vinifera (Jaillon et al., 2007), Solanum melongen(Barchi et al., 2021), Calotropis gigantea (Hoopes et al., 2018), Coffea canephora (Denoeud et al., 2014), Catharanthus roseus (Kellner, Kim, Clavijo, Hamilton, Childs, Vaillancourt, Cepela, Habermann, Steuernagel, Clissold, McLay, et al., 2015) and Oryza sativa (J. Yu et al., 2002)) were selected and aligned against the genome of S. tetraptera using GeMoMa (Keilwagen, Hartung, & Grau, 2019) v1.6.1. Augustus (Stanke, Steinkamp, Waack, & Morgenstern, 2004) v3.3.3, GlilmmerHMM (Majoros, Pertea, & Salzberg, 2004), and GeneScan (Burge & Karlin, 1997) were then employed for theab initio gene prediction. The assembled transcripts of S. tetraptera were used as the training set for Augustus. Finally, all forecasts produced by different strategies were integrated into a final gene set using EVidenceModeler v1.1.1(EVM) (Haas et al., 2008). BUSCO was used to assess the completeness of gene prediction.
For function annotation of the predicted protein-coding genes, three public databases – Swiss-Port, TrEMBL (Boeckmann et al., 2003), and NR (Coordinators, 2016) – were used to search against BLAST (Rédei, 2008). Then we used InterProScan (Quevillon et al., 2005) to predict information relating to protein domains. The Gene Ontology (GO) terms were retrieved by the pipeline of Blast2GO v2.5 (Conesa et al., 2005). The pathway information for each gene was assigned by the KEGG database (Conesa et al., 2005).