Le Hong Trang -

Fine-grained object recognition remains a challenging task due to the subtle visual differences between similar categories. This paper proposes a novel approach to address this challenge by enhancing both feature representation and object localization. Our method introduces a Multi-Classification Module (MCM) and a weakly supervised Multi-Segmentation Module (MSM). The MCM refines feature representations by training each sub-network within the backbone as an independent classifier. The MSM generates object masks from feature maps using a U-Net architecture, providing valuable localization information. These modules can be seamlessly integrated into various backbone networks. Extensive experiments on several benchmark datasets, including CUB, Stanford Cars, and FGVC-Aircraft, demonstrate the superior performance of our method. We also conducted experiments on surface defect datasets including Ball Screw and NEU-DET, to showcase the potential of our approach in machine vision applications.