2.4 Genome annotation
Repetitive elements were predicted in the genome ofS. tetraptera . We used TRF
(Benson, 1999) and MISA (Thiel, Michalek, Varshney, & Graner, 2003) to
identify the tandem repeats and simple sequence repeats (SSRs),
respectively. Transposable elements (TEs) were then identified based onde novo and homology-based strategies. RepeatMasker
(Tarailo-Graovac & Chen, 2009) v4.0.7 was used to run a homology search
for known repeat sequences against the Repbase database v22.11 (Jurka et
al., 2005). RepeatModeler (Jurka et al., 2005) v2.0.10 was employed to
predict the TEs based on the de novo method. Finally, all
identified repetitive elements were merged for subsequent analyses.
Protein-coding genes were then predicted in the repeat-masked S.
tetraptera genome based on integrated strategies. The RNA-seq reads
derived from the seven tissues (TableS2) were assembled using Trinity
v2.6.6 (Grabherr et al., 2011) in the de novo -based and
genome-guided modes, respectively. For transcriptome-based prediction,
the assembled transcripts produced in the two different ways were
combined and further aligned to the genome by PASA v2.1.0 to obtain the
gene structures. For homology-based prediction, protein sequences of
seven species (Arabidopsis thaliana (Kaul et al., 2000),Vitis vinifera (Jaillon et al., 2007), Solanum melongen(Barchi et al., 2021), Calotropis gigantea (Hoopes et al.,
2018), Coffea canephora (Denoeud et al., 2014),
Catharanthus roseus (Kellner, Kim, Clavijo, Hamilton, Childs,
Vaillancourt, Cepela, Habermann, Steuernagel, Clissold, McLay, et al.,
2015) and Oryza sativa (J. Yu et al., 2002)) were selected and
aligned against the genome of S. tetraptera using GeMoMa
(Keilwagen, Hartung, & Grau, 2019) v1.6.1.
Augustus (Stanke, Steinkamp, Waack,
& Morgenstern, 2004) v3.3.3, GlilmmerHMM (Majoros, Pertea, & Salzberg,
2004), and GeneScan (Burge & Karlin, 1997) were then employed for theab initio gene prediction. The assembled transcripts of S.
tetraptera were used as the training set for Augustus. Finally, all
forecasts produced by different strategies were integrated into a final
gene set using EVidenceModeler v1.1.1(EVM) (Haas et al., 2008). BUSCO
was used to assess the completeness of gene prediction.
For function annotation of the predicted protein-coding genes, three
public databases – Swiss-Port, TrEMBL (Boeckmann et al., 2003), and NR
(Coordinators, 2016) – were used to search against BLAST (Rédei, 2008).
Then we used InterProScan (Quevillon et al., 2005) to predict
information relating to protein domains. The Gene Ontology (GO) terms
were retrieved by the pipeline of Blast2GO v2.5 (Conesa et al., 2005).
The pathway information for each gene was assigned by the KEGG database
(Conesa et al., 2005).