Existing diffusion-based audio-driven talking face generation methods often suffer from poor lip synchronization and unnatural facial expressions, with these issues being particularly pronounced in cross-lingual scenarios. These performance issues stem from two fundamental scientific challenges: the semantic entanglement of facial intermediate representations (e.g., 3DMM parameters) and the isolated processing of spatial-temporal features. To address these challenges, we present a novel framework with two key innovations: (1) a data-driven semantic disentanglement approach that decomposes 3DMM parameters into meaningful subspaces for fine-grained facial region control, and (2) a hierarchical diffusion architecture with region-aware attention that jointly models spatial-temporal features throughout generation. We further introduce CHDTF, a Chinese high-definition talking face dataset for cross-lingual evaluation. Extensive experiments demonstrate that our method outperforms existing approaches, achieving superior lip synchronization and natural expressions while maintaining temporal coherence.
If you find our work useful, please consider citing: