Table 1 compares different methods on RHD, where EPE is the average error of hand joints, and GT H and GT S indicate handedness and scale of the hand, respectively. It can be seen that Spurr et al. [7] and Yang et al. [9] required additional input at test time, achieving lower joint errors, while our method could obtain low errors without ground-truth information during testing.
Conclusion: We proposed an MS-FF for monocular visual hand pose estimation. To effectively process the detailed information of occluded edges and fingertips, the network can extract information of different levels from feature maps of different resolutions to more accurately estimate hand poses. A channel conversion module adjusts the weights of channels. To make full use of both the edge detail characteristics of the images and deep semantic information, a global regression module fuses feature maps of different resolutions. An optimization procedure corrects some joints that are not returned to the correct position. Higher accuracy and robustness were achieved using the proposed method. Experiments verified the effectiveness of the MS-FF.
Acknowledgments: This work was supported by the National Natural Science Foundation of China under Grant 61601213, and Special Innovative Projects of General Universities in Guangdong Province under Grant 022WTSCX210.
Zhi Zhan (Guangdong Engineering Polytechnic,Zhan Zhi, China)
Guang Luo (South China Normal University, Luo Guang, China)
E-mail: luoguang_arts@163.com