A. Multiscale Feature Fusion Network
Gesture pictures usually contain complex detailed features. A strong correlation between fingers and joints is present. Therefore, the use of a single feature for hand pose estimation tends to ignore diverse feature information, which makes accurate extraction of more gesture information difficult. Fig. 1 shows the proposed MS-FF, whose purpose is to estimate the hand pose through a single RGB image. Feature maps of different resolutions are extracted from RGB images through the ResNet50 module. Feature maps are fed into the channel conversion module to explicitly learn the dependencies between channels, so as to enhance important information and downplay minor information. Because the level of feature information depends on the resolution of a feature map, the global regression module obtains high-resolution feature maps containing more semantic information, and these are separately input in the local optimization module to extract deeper information. The Gaussian heatmap of hand joints () is obtained to improve the spatial generalization ability of the model, and thus obtain more accurate joint locations. We take the feature map with the smallest resolution from the channel conversion module, through which the handedness () and relative depth information between the wrist joints () are obtained. The above results are combined to estimate the hand pose,
, (1)
, (2)
where equation (2) represents the result of gesture estimation, and and are the camera inverse projection and inverse affine transformation, respectively.