DisentTalk

Abstract

Existing diffusion-based audio-driven talking face generation methods often suffer from poor lip synchronization and unnatural facial expressions, with these issues being particularly pronounced in cross-lingual scenarios. These performance issues stem from two fundamental scientific challenges: the semantic entanglement of facial intermediate representations (e.g., 3DMM parameters) and the isolated processing of spatial-temporal features. To address these challenges, we present a novel framework with two key innovations: (1) a data-driven semantic disentanglement approach that decomposes 3DMM parameters into meaningful subspaces for fine-grained facial region control, and (2) a hierarchical diffusion architecture with region-aware attention that jointly models spatial-temporal features throughout generation. We further introduce CHDTF, a Chinese high-definition talking face dataset for cross-lingual evaluation. Extensive experiments demonstrate that our method outperforms existing approaches, achieving superior lip synchronization and natural expressions while maintaining temporal coherence.

DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentanglement Diffusion

Abstract

Method Overview

Supplementary Video Materials

Introduction Video in Different Languages

Comparison of Different Methods (HDTF dataset)

Comparison of Different Methods (CHDTF dataset)

Comparison of Different Methods (Voxceleb2 dataset, multilingual)

Disentanglement Analysis of 3DMM Parameters

Weight Comparison Experiment

Ablation Study on Different Components

Citation