Chenjing Sun -

Speech emotion recognition plays an important role in many applications, but the task is challenging due to various factors such as background noise, different speaker speech characteristics, etc. The well known speech emotion recognition system ACRNN uses CNN to extract local features of speech signals and attention mechanism focuses on the parts with prominent emotions. However, it has no ability to capture long-term global information and it also has no ability to jointly attend to the information from different representation subspaces at different positions because only one single attention module is used. In order to settle out the drawbacks of ACRNN, CoRNN is proposed in this letter by applying Conformer to replace the modules of CNN and attention module. The experimental results on IEMOCAP dataset demonstrate the unweighted average recall of the proposed CoRNN can achieve 65.53%, which improves 0.79% comparing with ACRNN.