Fan Zhang -

Speech-driven gesture generation is an emerging field within the domain of virtual human creation. The primary objective in this field is to attain authentic and personalized co-speech gestures while considering appropriate input conditions. However, a significant challenge lies in the difficulty of accurately determining the multitude of factors (such as acoustic, semantic, emotional, personality, and even subtle unknown features) inherent in these input conditions, which can be considered a complex class of fuzzy sets. Consequently, relying solely on explicit classification labels by manual annotation imposes limitations on the potential diversity of output states. To address these challenges, we introduce \textit{Persona-Gestor}, a novel approach integrating an automatic fuzzy feature inference mechanism with a probabilistic diffusion-based non-autoregressive transformer model. The fuzzy feature inference mechanism, embedded within a condition extractor, automatically extracts feature sets solely from raw speech audio data. These extracted features are subsequently employed as input conditions to facilitate the generation of personalized 3D full-body gestures. The condition extractor effectively leverages the WavLM large-scale pre-trained model to seamlessly capture local and global audio information into a unified latent representation associated with gestures. This all pertinent information eliminates the necessity for manual annotation labels, thereby streamlining multimodal processing. Furthermore, we employ adaptive layer normalization to enhance the modeling of the intricate relationship between speech and gestures. Finally, the learning and synthesis stages are facilitated through a diffusion process, leading to a wide range of gesture-generation outcomes. Extensive subjective and objective evaluations conducted on three high-quality co-speech gesture datasets (Trinity, ZEGGS, and BEAT) demonstrate our method’s superior performance compared to recent approaches.