Liu Yunxiang -

Despite speech emotion recognition(SER) makes a significant contribution to artificial intelligence, there exists a heterogeneity gap between different modalities. Moreover, most cross-corpus SER only use audio modality. There are few studies on cross-corpus bimodal SER. Motivated by these problems, in this work, we address these issues at the same time. We design YouTube dataset as a source data and interactive emotional dyadic motion capture database (IEMOCAP) as a target data. In both source data and target data, we use CNN and bidirectional long short term memory network (Bi-LSTM) to extract speech features and use Bidirectional Encoder Representation from Transformers (BERT)+ Bi-LSTM to extract text features , then we design modality-invariance loss to form a common representation space of two modalities. To deal with the problem of cross-corpus SER, we learn a common subspace of source data and target data by optimizing Linear Discriminant analysis(LDA), Maximum Mean Discrepancy (MMD) and Graph Embedding (GE) jointly. To preserve emotion-discriminative features, we add emotion-aware center loss .We use SVM classifier as final emotion classification.The experiment results on IEMOCAP demonstrate that our method is superior to other state-of-art cross-corpus and bimodal SER.