Lingyan Zhang -

Scene classification presents significant challenges in computer vision, crucial for advancements in robotics and automation. These challenges include intra-class variation, varying object sizes, and spatial information diversity. To address these issues, this paper proposes MFP-CNN, a Multi-Scale Fusion and Pooling Convolutional Neural Network. We introduce a Lightweight Multi-Stage Feature Fusion (LMSFF) method to effectively capture multi-scale information and enhance feature discriminability, thus mitigating intra-class diversity. We incorporate Spatial Pyramid Pooling (SPP) to improve scale invariance by considering features at various spatial scales. Additionally, we use the Squeeze-and-Excitation (SE) attention mechanism to focus on informative regions, resulting in improved feature extraction and accuracy. We also introduce the Scene7 dataset, which features diverse real-world scenes with annotations for both scene classification and object detection. Extensive experiments on custom and public datases validate the effectiveness of MFP-CNN, achieving an impressive top-1 accuracy of 97.4% on Scene7, surpassing state-of-the-art methods. Notably, our model achieves this with a parameter-efficient design, utilizing only 2.13MB for trained parameters. MFP-CNN offers an efficient and impactful solution for scene classification, advancing robot vision applications. Access the source code and dataset at https://github.com/Wanelle/MFP-CNN.