* ShunYi

and 1 more

The Transformer architecture has made breakthroughs in the field of natural language processing, and pre-trained models represented by BERT and GPT have demonstrated excellent performance. Aiming at the problem that the generalization ability of statistical G2P models is limited in the low-resource Mongolian language environment, this study proposes an end-to-end Mongolian grapheme-phoneme conversion model based on the Transformer architecture. The model effectively models the contextual dependency of Mongolian characters through the hierarchical representation learning of the multi-head self-attention mechanism, improving the robustness of conversion. In response to the scarcity of annotated Mongolian data, this study constructs a grapheme-phoneme aligned corpus containing 25,000 entries. Experiments show that compared with the baseline model (Sequitur G2P), this model achieves a 5.6% reduction in the WER index. Further hyperparameter analysis reveals that the collaborative optimization of the intermediate layer dimension of the feed-forward layer network and the number of attention heads has a significant impact on model performance. This study makes contributions in the following three aspects: (1) It is the first to apply the Transformer architecture to the Mongolian G2P task; (2) It constructs a Mongolian grapheme-phoneme aligned corpus, providing data support for low-resource Mongolian language research; (3) It systematically evaluates the influence rules of model hyperparameters on performance, providing an experimental benchmark for follow-up research.