TalkVid:面向音频驱动说话头合成的大规模多样化数据集
TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis
August 19, 2025
作者: Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, Benyou Wang
cs.AI
摘要
音频驱动的说话头合成技术已实现了显著的逼真效果,然而,当前最先进的模型却暴露出一大缺陷:它们无法全面泛化至人类在种族、语言和年龄群体上的多样性。我们认为,这一泛化差距直接反映了现有训练数据在规模、质量和多样性方面的局限性。为解决这一挑战,我们推出了TalkVid,这是一个大规模、高质量且多样化的数据集,包含来自7729位独特发言者的1244小时视频。TalkVid通过一个原则性、多阶段的自动化流程精心筛选,严格把控动作稳定性、美学质量及面部细节,并通过人工验证确保其可靠性。此外,我们构建并发布了TalkVid-Bench,这是一个分层评估集,包含500个片段,在关键人口统计和语言维度上精心平衡。实验表明,基于TalkVid训练的模型优于以往数据集训练的模型,展现出更优的跨数据集泛化能力。重要的是,我们在TalkVid-Bench上的分析揭示了传统聚合指标所掩盖的子群体间性能差异,强调了其在未来研究中的必要性。代码与数据可在https://github.com/FreedomIntelligence/TalkVid获取。
English
Audio-driven talking head synthesis has achieved remarkable photorealism, yet
state-of-the-art (SOTA) models exhibit a critical failure: they lack
generalization to the full spectrum of human diversity in ethnicity, language,
and age groups. We argue that this generalization gap is a direct symptom of
limitations in existing training data, which lack the necessary scale, quality,
and diversity. To address this challenge, we introduce TalkVid, a new
large-scale, high-quality, and diverse dataset containing 1244 hours of video
from 7729 unique speakers. TalkVid is curated through a principled, multi-stage
automated pipeline that rigorously filters for motion stability, aesthetic
quality, and facial detail, and is validated against human judgments to ensure
its reliability. Furthermore, we construct and release TalkVid-Bench, a
stratified evaluation set of 500 clips meticulously balanced across key
demographic and linguistic axes. Our experiments demonstrate that a model
trained on TalkVid outperforms counterparts trained on previous datasets,
exhibiting superior cross-dataset generalization. Crucially, our analysis on
TalkVid-Bench reveals performance disparities across subgroups that are
obscured by traditional aggregate metrics, underscoring its necessity for
future research. Code and data can be found in
https://github.com/FreedomIntelligence/TalkVid