Manikanta Kadamba

and 1 more

Recent development of speech emotion recognition (SER) causes have shifted more and more to enhancing the computer’s understanding of people’s emotions by using their voice. This paper proposes a new SER algorithm that reduces the difficulty of the process by directly working with the raw speech data rather than depending on the human expert’s choice of the acoustic features. The method proposed is therefore a Conformer-CNN model where one part derives from the Conformer block, which is adept at learning long-term temporal features, and the other from the CNN that is proficient at capturing localized emotional features from audio inputs. The Conformer block encodes speech in a dynamic domain and retain its contextual information within the spectrogram, while the CNN layers are tailored for extracting fine-grained emotional patterns from the spatial domain of spectrograms. Combined, these components provide the ability to capture fine-grained and coarse-grained emotion patterns, which in turn makes the analysis of speech data more profound. This combination of approaches allows the application to capture the rather complex interactions between human emotions and all the non-static features of speech safely. When tested on multilingual dataset, the algorithm demonstrated the improvements in terms of the accuracy and the interpretability as well as the ability for capturing temporal and affective contexts. This research benefits the general and wide field of human-computer studies and interaction to wit, refining reaction of machineries to human emotions more appropriately and accurately.