In recent years, multimodal sentiment analysis has become a hot research topic at the intersection of natural language processing and computer vision. Especially in applications like short videos and social media, text, visual, and audio information collectively convey rich emotional expressions. However, efficiently fusing information from different modalities remains a challenge. Existing methods often suffer from performance degradation due to insufficient inter-modal information fusion or by treating each modality’s features equally. To address this issue, this paper proposes a Multimodal Information Fusion with Self-Attention (MIFSA) model based on self-attention mechanisms and information fusion. The model first encodes the extracted feature sequences through an encoder. Then, it utilizes a maximum mutual information method to optimize the correlations between text and visual modalities, as well as between text and audio modalities, combining text-visual and text-audio fused modalities with unimodal contrastive information for training. Subsequently, a self-attention mechanism is employed to dynamically assign weights to the fused features of text, visual, and audio. Finally, a multilayer perceptron is used for sentiment prediction. Experimental results show that MIFSA improves the Mean Absolute Error (MAE) and Pearson Correlation Coefficient (Corr) on the CMU-MOSI dataset by 2.27% and 1.13%, respectively, compared to the optimal model MSTFN; on the CMU-MOSEI dataset, it achieves improvements of 2.42% and 1.84%. The results from ablation studies and case analyses further validate the effectiveness of the MIFSA model in multimodal sentiment analysis tasks.