Chromosome construction and genome assessment
Pseudo-chromosomes were constructed on the basis of the assembled draft
genome and the reads generated from the Hi-C library. After quality
control of Hi-C data, clean reads were mapped to the assembled draft
genome via Bowtie2 (Langdon, 2015) with default parameters. Mapped reads
were further clustered using Juicer (Durand et al., 2016), followed by
ordering and orientation performed with 3d-dna (Dudchenko et al., 2017).
Finally, the assembled genome was anchored to 30 chromosomes.
To assess the quality of our final chromosome-level genome assembly,
contig N50 and scaffold N50 values were calculated for comparison with
those of other Siluriformes species. With the popular
actinopterygii_odb9 database, Benchmarking Universal Single-Copy
Orthologs (BUSCO) was employed to evaluate the completeness of the
assembled striped catfish genome (Simão et al., 2015).
Genome annotations and
gene predictions
Prediction of repeat elements was based on de novo and homology
methods. RepeatModeler (Smit et al., 2014) and LTR-FINDER (Xu & Wang,
2007) were primarily used for the de novo prediction, generating
a repeat library respectively. Both libraries were combined and then
aligned to the assembled genome with RepeatMasker (Tarailo‐Graovac &
Chen, 2009). Meanwhile, the homology prediction was performed using
RepeatMasker and RepeatProteinMask (Tarailo‐Graovac & Chen, 2009),
based on the known repeat library (Repbase; Jurka et al., 2005). In
addition, tandem repeats were detected with Tandem Repeat Finder
(Benson, 1999). Finally, by integrating these predicted data, we
obtained nonredundant repeat elements.
We employed two approaches, homology and
transcriptome-based, to predict
protein-coding genes. For the homology-based prediction, protein
sequences of 11 representative vertebrate species, including
zebrafish (Danio rerio ),
channel catfish (Ictalurus punctatus ), Atlantic cod (Gadus
morhua ), three-spined stickleback (Gasterosteus aculeatus ),
Australian ghost shark (Callorhinchus milii ), spotted gar
(Lepisosteus oculatus ), Nile tilapia (Oreochromis
niloticus ), medaka (Oryzias latipes ), Japanese pufferfish
(Takifugu rubripes ), green spotted puffer (Tetraodon
nigroviridis ) and human (Homo sapiens ), were downloaded from
Ensembl (Flicek et al., 2013) for mapping to our assembled genome with
TBLASTn (Gertz et al., 2006). Subsequently, GeneWise (Birney et al.,
2004) was used to predict gene
structures of the achieved
alignments. For the transcriptome-based prediction, we applied Cufflinks
(Ghosh & Chan, 2016) and Hisat (Kim et al., 2015) to predict gene
structures with generated transcriptome data. Finally, these predicted
results were integrated using MAKER (Campbell et al., 2014) to obtain a
final consistent gene set.
To perform functional annotation, we employed BLASTp (Altschul et al.,
1990) to align the predicted protein sequences against four public
databases, including SwissProt
(Boeckmann et al., 2003), TrEMBL (Boeckmann et al., 2003), KEGG
(Kanehisa & Goto, 2000) and InterPro (Hunter et al., 2009). These
results were retrieved using Gene Ontology (GO; Consortium, 2004) terms.
Gene
family analysis
To construct gene families form protein-coding genes of striped catfish,
We downloaded coding sequences (CDS) of 7 representative vertebrate
species, including human, mouse (Mus musculus ), zebrafish, yellow
catfish (Tachysurus fulvidraco ), channel catfish, black bullhead
(Ameiurus melas ), giant devil catfish (Bagarius yarrelli )
from Genbank (Benson et al., 2010). After multiple sequence alignment
with predicted CDS of striped catfish and other species using BLASTp
(Altschul et al., 1990) (e-value ≤ 1e-5), gene families were clustered
with OrthMCL (Li et al., 2003). In order to reveal the
phylogenetic position of striped
catfish, we employed ClustalW (Thompson et al., 2003) to align the CDS
sequences of single-copy ortholog gene families.
After obtaining the conserved regions via Gblocks (Castresana, 2002),
the aligned CDS of all single-copy genes were connected as a supergene.
With phase1 sites extracted from the supergene,
a phylogenetic tree was
constructed using PhyML with the maximum likelihood method (Guindon et
al., 2009). Subsequently, the MCMCTREE model in PAML (Yang, 1997) was
employed to estimate the divergence times, with assistance of fossil
calibration from TIMETREE (http://www.timetree.org/). Molecular
clocks include 85-97 million years ago (Mya) between human and mouse, as
well as 130-174 Mya between zebrafish and stripped catfish (P.
hypophthalmus ). Moreover, via CAFE (De Bie et al., 2006), we identified
expanded and contracted gene families based on clustered gene families
and the achieved phylogenetic tree.