Fan Zhang

and 14 more

Speech-driven gesture generation using transformerbased generative models represents a rapidly advancing area within virtual human creation. However, existing models face significant challenges due to their quadratic time and space complexities, limiting scalability and efficiency. To address these limitations, we introduce DiM-Gestor, an innovative end-to-end generative model leveraging the Mamba-2 architecture. DiM-Gestor features a dual-component framework: (1) a fuzzy feature extractor and (2) a speech-to-gesture mapping module, both built on the Mamba-2. The fuzzy feature extractor, integrated with a Chinese Pre-trained Model and Mamba-2, autonomously extracts implicit, continuous speech features. These features are synthesized into a unified latent representation and then processed by the speech-to-gesture mapping module. This module employs an Adaptive Layer Normalization (AdaLN)-enhanced Mamba-2 mechanism to uniformly apply transformations across all sequence tokens. This enables precise modeling of the nuanced interplay between speech features and gesture dynamics. We utilize a diffusion model to train and infer diverse gesture outputs. Extensive subjective and objective evaluations conducted on the newly released Chinese Co-Speech Gestures dataset corroborate  the efficacy of our proposed model. Compared with Transformerbased architecture, the assessments reveal that our approach delivers competitive results and significantly reduces memory usage—approximately 2.4 times—and enhances inference speeds by 2 to 4 times. Additionally, we released the CCG dataset, a Chinese Co-Speech Gestures dataset, comprising 15.97 hours (six styles across five scenarios) of 3D full-body skeleton gesture motion performed by professional Chinese TV broadcasters.

Fan Zhang

and 10 more

Speech-driven gesture generation is an emerging field within the domain of virtual human creation. The primary objective in this field is to attain authentic and personalized co-speech gestures while considering appropriate input conditions. However, a significant challenge lies in the difficulty of accurately determining the multitude of factors (such as acoustic, semantic, emotional, personality, and even subtle unknown features) inherent in these input conditions, which can be considered a complex class of fuzzy sets. Consequently, relying solely on explicit classification labels by manual annotation imposes limitations on the potential diversity of output states. To address these challenges, we introduce \textit{Persona-Gestor}, a novel approach integrating an automatic fuzzy feature inference mechanism with a probabilistic diffusion-based non-autoregressive transformer model. The fuzzy feature inference mechanism, embedded within a condition extractor, automatically extracts feature sets solely from raw speech audio data. These extracted features are subsequently employed as input conditions to facilitate the generation of personalized 3D full-body gestures. The condition extractor effectively leverages the WavLM large-scale pre-trained model to seamlessly capture local and global audio information into a unified latent representation associated with gestures. This all pertinent information eliminates the necessity for manual annotation labels, thereby streamlining multimodal processing. Furthermore, we employ adaptive layer normalization to enhance the modeling of the intricate relationship between speech and gestures. Finally, the learning and synthesis stages are facilitated through a diffusion process, leading to a wide range of gesture-generation outcomes. Extensive subjective and objective evaluations conducted on three high-quality co-speech gesture datasets (Trinity, ZEGGS, and BEAT) demonstrate our method’s superior performance compared to recent approaches.