Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space

Anonymous Authors

Submitted to ICME 2025

Abstract

Audio-driven emotional 3D facial animation encounters two significant challenges: (1) reliance on single-modal control signals (videos, text, or emotion labels) without leveraging their complementary strengths for comprehensive emotion manipulation, and (2) deterministic regression-based mapping that constrains the stochastic nature of emotional expressions and non-verbal behaviors, limiting the expressiveness of synthesized animations. To address these challenges, we present a diffusion-based framework for controllable expressive 3D facial animation. Our approach introduces two key innovations: (1) a FLAME-centered multimodal emotion binding strategy that aligns diverse modalities (text, audio, and emotion labels) through contrastive learning, enabling flexible emotion control from multiple signal sources, and (2) an attention-based latent diffusion model with content-aware attention and emotion-guided layers, which enriches motion diversity while maintaining temporal coherence and natural facial dynamics. Extensive experiments demonstrate that our method outperforms existing approaches across most metrics, achieving a 21.6\% improvement in emotion similarity while preserving physiologically plausible facial dynamics.

Method Overview

Framework Overview

Supplementary Video Materials

Introduction Video in Different Emotions

Introduction Video in Different Emotions

Comparison of Different Methods (MEAD_emo dataset)

Comparison of Different Methods (MEAD_emo dataset)

Comparison of Different Methods (HDTF dataset)

Comparison of Different Methods (HDTF dataset)

Weight Comparison Experiment

Demonstration of Different Weights Effects

Comparison of Different Modalities Driven Methods

Comparison of Different Modalities Driven Methods

Ablation Study on Different Components

Ablation Study on Different Components

Citation

If you find our work useful, please consider citing: