Chromosome construction and genome assessment
Pseudo-chromosomes were constructed on the basis of the assembled draft genome and the reads generated from the Hi-C library. After quality control of Hi-C data, clean reads were mapped to the assembled draft genome via Bowtie2 (Langdon, 2015) with default parameters. Mapped reads were further clustered using Juicer (Durand et al., 2016), followed by ordering and orientation performed with 3d-dna (Dudchenko et al., 2017). Finally, the assembled genome was anchored to 30 chromosomes.
To assess the quality of our final chromosome-level genome assembly, contig N50 and scaffold N50 values were calculated for comparison with those of other Siluriformes species. With the popular actinopterygii_odb9 database, Benchmarking Universal Single-Copy Orthologs (BUSCO) was employed to evaluate the completeness of the assembled striped catfish genome (Simão et al., 2015).
Genome annotations and gene predictions
Prediction of repeat elements was based on de novo and homology methods. RepeatModeler (Smit et al., 2014) and LTR-FINDER (Xu & Wang, 2007) were primarily used for the de novo prediction, generating a repeat library respectively. Both libraries were combined and then aligned to the assembled genome with RepeatMasker (Tarailo‐Graovac & Chen, 2009). Meanwhile, the homology prediction was performed using RepeatMasker and RepeatProteinMask (Tarailo‐Graovac & Chen, 2009), based on the known repeat library (Repbase; Jurka et al., 2005). In addition, tandem repeats were detected with Tandem Repeat Finder (Benson, 1999). Finally, by integrating these predicted data, we obtained nonredundant repeat elements.
We employed two approaches, homology and transcriptome-based, to predict protein-coding genes. For the homology-based prediction, protein sequences of 11 representative vertebrate species, including zebrafish (Danio rerio ), channel catfish (Ictalurus punctatus ), Atlantic cod (Gadus morhua ), three-spined stickleback (Gasterosteus aculeatus ), Australian ghost shark (Callorhinchus milii ), spotted gar (Lepisosteus oculatus ), Nile tilapia (Oreochromis niloticus ), medaka (Oryzias latipes ), Japanese pufferfish (Takifugu rubripes ), green spotted puffer (Tetraodon nigroviridis ) and human (Homo sapiens ), were downloaded from Ensembl (Flicek et al., 2013) for mapping to our assembled genome with TBLASTn (Gertz et al., 2006). Subsequently, GeneWise (Birney et al., 2004) was used to predict gene structures of the achieved alignments. For the transcriptome-based prediction, we applied Cufflinks (Ghosh & Chan, 2016) and Hisat (Kim et al., 2015) to predict gene structures with generated transcriptome data. Finally, these predicted results were integrated using MAKER (Campbell et al., 2014) to obtain a final consistent gene set.
To perform functional annotation, we employed BLASTp (Altschul et al., 1990) to align the predicted protein sequences against four public databases, including SwissProt (Boeckmann et al., 2003), TrEMBL (Boeckmann et al., 2003), KEGG (Kanehisa & Goto, 2000) and InterPro (Hunter et al., 2009). These results were retrieved using Gene Ontology (GO; Consortium, 2004) terms.
Gene family analysis
To construct gene families form protein-coding genes of striped catfish, We downloaded coding sequences (CDS) of 7 representative vertebrate species, including human, mouse (Mus musculus ), zebrafish, yellow catfish (Tachysurus fulvidraco ), channel catfish, black bullhead (Ameiurus melas ), giant devil catfish (Bagarius yarrelli ) from Genbank (Benson et al., 2010). After multiple sequence alignment with predicted CDS of striped catfish and other species using BLASTp (Altschul et al., 1990) (e-value ≤ 1e-5), gene families were clustered with OrthMCL (Li et al., 2003). In order to reveal the phylogenetic position of striped catfish, we employed ClustalW (Thompson et al., 2003) to align the CDS sequences of single-copy ortholog gene families.
After obtaining the conserved regions via Gblocks (Castresana, 2002), the aligned CDS of all single-copy genes were connected as a supergene. With phase1 sites extracted from the supergene, a phylogenetic tree was constructed using PhyML with the maximum likelihood method (Guindon et al., 2009). Subsequently, the MCMCTREE model in PAML (Yang, 1997) was employed to estimate the divergence times, with assistance of fossil calibration from TIMETREE (http://www.timetree.org/). Molecular clocks include 85-97 million years ago (Mya) between human and mouse, as well as 130-174 Mya between zebrafish and stripped catfish (P. hypophthalmus ). Moreover, via CAFE (De Bie et al., 2006), we identified expanded and contracted gene families based on clustered gene families and the achieved phylogenetic tree.