Ghada Alhussein -

The growing availability and variety of conversational data on multiple platforms have sparked a rising interest in dynamic emotion recognition.Speech plays a crucial role in establishing a dynamic emotional climate (EC) during peer conversation. In this study, we introduce a novel approach, DeepBispec, which utilizes deep bispectral conversational speech processing to extract features that can be utilized for predicting the emotions expressed during the conversation. Incorporating bispectrum images into a CNN model, DeepBispec extracts deep features, combines them with Affect dynamics (AD), and conducts EC classification.Three open data-sets proposed,i.e., K-EmoCon, IEMOCAP, and SEWA were used to test and cross-validate on DeepBispec in terms of EC arousal/valence level classification. The experimental results have shown that combining deep features with AD enhanced the performance of DeepBispec from an accuracy of 79% for EC arousal to 81.4% with AD and an average accuracy of 76.8% for EC valence to 77.5% with AD for K-EmoCon. IEMOCAP data-set displays similar trend with an average accuracy of 77.2% for arousal increasing to 79.6% and an increase from 65.7% to 73.6% for valance. The results show that the proposed approach outperforms other state-of-the-art approaches, including deep learning architectures like CNN and LSTM, in the domain of speech-based emotion recognition.The Bispectrum-based features capture the emotional content of the voice signal in a unique manner that conventional DL models do not achieve.