Qing Zhao

and 3 more

Since noise significantly impacts speech quality and intelligibility, it is important to eliminate it in speech enhancement. Conformer networks have gained popularity in this area due to their outstanding noise reduction performance. This module combines an attention mechanism with a convolutional neural network to capture both long- and short-duration speech sequence information. Building on this, this paper proposes a Dual-Stream Interactive Conformer Network (DSICNet), employing the Conformer module as the core for extracting signal features. In the network’s core processing, both amplitude and phase information first undergo time-domain feature extraction, followed by frequency-domain feature extraction. While the attentional mechanism effectively captures dependencies within the input speech sequence, it tends to overlook features from other input data. To address this, the DSICNet model introduces two interaction modules in the enhancement layer and configures them to prioritize magnitude as the dominant information. This fused information then aids in generating dual path features. To ensure consistency of feature estimation on dual stream paths and to reduce errors caused by uncontrolled learning, the fusion module acts as a preprocessor for the enhancement layer. Experimental results on the VoiceBank + DEMAND dataset presents a marked improvement in denoising performance compared to other models. Ablation studies further underscore the importance of the interaction module, indicating that setting the mainstream parameter of the interaction module as the amplitude information yields superior results compared to using phase as the mainstream information.