Abstract:The convolutional operation in a convolutional neural network only captures local information, whereas the Transformer retains more spatial information and can create long-range connections of images. In the application of vision field, Transformer lacks flexible image size and feature scale adaptation capability. To solve these problems, the flexibility of modeling at different scales is enhanced by using hierarchical networks, and a multi-scale feature fusion module is introduced to enrich feature information. This paper proposes an improved Swin Face model based on the Swin Transformer model. The model uses the Swin Transformer as the backbone network and a multi-level feature fusion model is introduced to enhance the feature representation capability of the Swin Face model for human faces. A joint loss function optimisation strategy is used to design a face recognition classifier to realize face recognition. The experimental results show that, compared with various face recognition methods, the Swin Face recognition method achieves the best results on LFW, CALFW, AgeDB-30, and CFP datasets by using a hierarchical feature fusion network, and also has good generalization and robustness.