TalkVid：一個用於音頻驅動說話頭像合成的大規模多樣化數據集

摘要

音頻驅動的說話頭合成技術已實現了顯著的逼真效果，然而現有的最先進（SOTA）模型卻暴露出一項關鍵缺陷：它們無法全面泛化到涵蓋不同種族、語言和年齡層的人類多樣性。我們認為，這種泛化差距直接反映了現有訓練數據在規模、質量及多樣性上的不足。為應對這一挑戰，我們推出了TalkVid，這是一個全新的大規模、高質量且多樣化的數據集，包含來自7729位獨特講者的1244小時視頻。TalkVid通過一個原則性、多階段的自動化流程精心篩選，嚴格把控動作穩定性、美學質量及面部細節，並通過人類判斷驗證以確保其可靠性。此外，我們構建並發布了TalkVid-Bench，這是一個分層的評估集，包含500個片段，在關鍵的人口統計學和語言學維度上精心平衡。實驗表明，基於TalkVid訓練的模型在跨數據集泛化能力上優於基於以往數據集訓練的對比模型。重要的是，我們在TalkVid-Bench上的分析揭示了傳統聚合指標所掩蓋的子群體間性能差異，強調了其對未來研究的必要性。代碼與數據可於https://github.com/FreedomIntelligence/TalkVid 獲取。

English

Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid

TalkVid：一個用於音頻驅動說話頭像合成的大規模多樣化數據集

TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

摘要

Support