Zujun Liu -

Fine-grained image recognition aims to achieve classification of subclasses by processing detailed features, which is still a critical problem to be solved in computing due to the small differences between subclasses. Most of the work extracts the detected features by reusing the backbone network or by using RPN (RegionProposal Network), these operations undoubtedly increase the complexity of the work. In recent years, Transformer has shown satisfactory performance in vision tasks. Transformer decomposes input images into patch of the same size, and classifies the input images by computing the attention scores between patches multiple times and weighting them. In this paper, we propose the PIFM (Patch Impact Factor Module) with reference to the SENet. Specifically, the weight calculation is performed on the patch obtained from each transformer layer calculation, then the patch is fused using the calculated weights. The result of the weight calculation represents the importance of the patch and indicates the factor by which the network should fuse the patch. To verify the effectiveness of our method, we conducted experiments on the CUB-200-2011 and stanford-dog datasets.