DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentanglement Diffusion

Kangwei Liu, Junwu Liu, Yun Cao, Jinlin Guo, Xiaowei Yi

Accepted by ICME 2025

Abstract

Existing diffusion-based audio-driven talking face generation methods often suffer from poor lip synchronization and unnatural facial expressions, with these issues being particularly pronounced in cross-lingual scenarios. These performance issues stem from two fundamental scientific challenges: the semantic entanglement of facial intermediate representations (e.g., 3DMM parameters) and the isolated processing of spatial-temporal features. To address these challenges, we present a novel framework with two key innovations: (1) a data-driven semantic disentanglement approach that decomposes 3DMM parameters into meaningful subspaces for fine-grained facial region control, and (2) a hierarchical diffusion architecture with region-aware attention that jointly models spatial-temporal features throughout generation. We further introduce CHDTF, a Chinese high-definition talking face dataset for cross-lingual evaluation. Extensive experiments demonstrate that our method outperforms existing approaches, achieving superior lip synchronization and natural expressions while maintaining temporal coherence.

Method Overview

Framework Overview

Supplementary Video Materials

Introduction Video in Different Languages

Introduction Videos in Different Languages

Comparison of Different Methods (HDTF dataset)

Comparison of Different Methods (HDTF dataset)

Comparison of Different Methods (CHDTF dataset)

Comparison of Different Methods (CHDTF dataset)

Comparison of Different Methods (Voxceleb2 dataset, multilingual)

Comparison of Different Methods (Voxceleb2 dataset, multilingual)

Disentanglement Analysis of 3DMM Parameters

Demonstration of Disentanglement of 3DMM Parameters

Weight Comparison Experiment

Demonstration of Different Weights Effects

Ablation Study on Different Components

Ablation Study on Different Components

Citation

If you find our work useful, please consider citing: