Genome assembly and annotation
The original draft genomes used paired-end and mate-pair Illumina
library sequencing (Keeling et al., 2013c). We made substantial
improvements to these assemblies with proximity ligation-based
scaffolding with HiRise; linkage-map/ALLMAPS-informed corrections and
scaffolding; further improvements with LINKS, RAILS, and ABySS-Sealer;
and PhylOligo-based removal of contaminant scaffolds (Fig. 1, Tab. 1).
All of these tools were developed after the original draft assemblies
were prepared. A comparison of scaffold sizes between draft and final
genome assemblies is shown in Supp. Fig. 2. The final female and male
genome assembly sizes were 223.7 and 224.8 Mb, with N50s/L50s of 16.6
Mb/4 and 16.4 Mb/4, respectively. Gregory et al. (2013) used flow
cytometry to estimate a 208 Mb genome size. The non-N portions of the
genome assemblies were very similar to this value, 214.0 Mb for the
female assembly and 210.5 Mb for the male assembly. Compared to the
draft assemblies, N50 values increased by 26- and 36-fold, and the
number of scaffolds decreased by 67 and 75 percent, respectively. Ninety
percent of each assembly was contained in the largest 12 (female) and 11
(male) scaffolds. Based on linkage mapping information, these 12 largest
scaffolds in the female assembly represent the karyotype of this species
(11 AA + neo-XX). The male assembly did not contain a large scaffold
representing the neo-Y chromosome.
Each step in the assembly process contributed to the improved
assemblies, and incremental assembly statistics at each step are shown
in Supp. Tab. 1. Chicago HiRise scaffolding dramatically increased
contiguity, reducing the number of scaffolds by 56-66%. Hi-C HiRise
scaffolding reduced the number of scaffolds by an additional 21%. The
linkage map information allowed us to correct misjoins in the HiRise
assemblies and join additional scaffolds. Visualization of the linkage
map information with ALLMAPS allowed us to identify several instances
where scaffolds from the Chicago HiRise step were flipped and/or
out-of-order with adjacent scaffolds compared to the linkage map
information and the assembly from the other sex when they were
scaffolded at the Hi-C HiRise step, even though both assemblies were
based upon the same scaffolding information. An example is shown in Fig.
2. In total, nine of the twelve largest scaffolds were modified (Supp.
Fig. 3).
In one case only, a scaffold from the draft male assembly was flipped
and misplaced during the earlier Chicago HiRise step. Based on linkage
map information, ALLMAPS joined three scaffolds to make the neo-X in the
female assembly, and four scaffolds to make the neo-X in the male
assembly. This made the neo-X scaffold the largest scaffold in both
final assemblies. The LINKS scaffolding step made only two and six
joins, the RAILS step made eight and eleven joins while also filling in
18% and 9% of the existing gaps within scaffolds, and ABySS-Sealer
filled in 38% and 47% of the remaining gaps of the female and male
genomes, respectively. We then identified and removed contaminant
scaffolds with PhylOligo. These contaminant scaffolds from the female
and male assemblies matched most similarly to Serratia spp. andAcinetobacter spp., respectively. Both of these genera in the
Gammaproteobacteria have been found in the bark beetle gut bacteriome
(Hernández-García et al., 2017). The final assemblies showed good
consistency between sexes in both shared synteny and chromosomal
arrangement (Supp. Fig. 4), and also contained 95% of the 1367 Insecta
orthologous gene set (Insecta_odb10, Creation date: 2020-09-10, Supp.
Fig. 5).
To annotate the genome, we used evidence from coleopteran proteins andDendroctonus spp. transcripts, with ab initio methods for
gene prediction with three rounds of Maker3. We identified 13 393 and
13 601 gene models in the female and male genomes, respectively. This
represents approximately a 4% increase from the original draft genome
annotations. These gene models contained 91% of the Insecta orthologous
gene set (Supp. Fig. 5) and 74% shared significant homology to proteins
in the UniProtKB/Swiss-Prot 2020_01 database. Repetitive elements
occupied approximately 23% and 20% of the female and male genome
assemblies, respectively (Supp. Tab. 2).