Fig. 2 Structure of channel conversion module.
The channel conversion module has aggregation and excitation stages. Global feature information of spatial dimension is aggregated into a channel descriptor of dimension by average pooling. The c-th element calculation of vector A is
, (3)
where and are the height and width, respectively, of the feature map; and is the pixel in the c -th channel. The average feature of each channel in the feature map is calculated by aggregation.
To fully utilize the aggregated feature information, the excitation operation captures the dependencies between channels. The aggregated information learns the inter-channel dependencies through the fully connected layer. The weight vector with dimension is obtained by the sigmoid function, and can characterize the importance of each channel. The weight vector is multiplied by the original feature map to obtain the reassigned feature map, which can enhance important information and weaken the minor information. The dependency between the channels is
, (4)
where is the calculation of channel weights, and are the weight matrices of the two fully connected layers, is the sigmoid function, and is the ReLU function. The channel information of the feature map is recalibrated as
, (5)
where is the feature map after reassigning channel weights, and is channel-wise multiplication between the weight vector and feature vector .

C. Global Regression Module

The ResNet50 module produces feature maps with different resolutions. High-resolution, low-level feature maps contain less semantic information but rich spatial detail information, while low-resolution, high-level feature maps have rich semantic information and less spatial detail information. To fully exploit the feature information of different dimensions, the low- and high-resolution feature maps are combined by vertical and horizontal paths. The vertical path obtains the high-resolution feature map by upsampling the spatially low-resolution feature map. Then, 1 × 1 convolution is used to reduce the number of channels in the low-level feature map, so as to obtain a feature map with the same dimension as the corresponding longitudinal path feature. The horizontal path fuses the two feature maps (Fig. 3). This pyramidal structure allows feature maps of different resolutions to contain more semantic information, enabling the network to learn richer feature information.