TalkVid: 音声駆動型トーキングヘッド合成のための大規模多様化データセット

要旨

オーディオ駆動型の話し手合成技術は、驚くべきフォトリアリズムを達成してきました。しかし、最先端（SOTA）のモデルには重大な欠陥があります：人種、言語、年齢層といった人間の多様性の全範囲にわたる汎化能力が欠けているのです。私たちは、この汎化ギャップが既存のトレーニングデータの限界に起因する直接的な症状であると主張します。既存のデータは、必要な規模、品質、多様性を備えていません。この課題に対処するため、私たちはTalkVidという新しい大規模で高品質かつ多様なデータセットを紹介します。TalkVidは7729人のユニークな話し手による1244時間のビデオを含んでいます。TalkVidは、モーションの安定性、美的品質、顔の詳細を厳密にフィルタリングする原則に基づいた多段階の自動化パイプラインを通じてキュレーションされ、その信頼性を確保するために人間の判断に対して検証されています。さらに、私たちはTalkVid-Benchを構築し、公開しました。これは、主要な人口統計学的および言語学的軸にわたって慎重にバランスを取った500のクリップから成る層別評価セットです。私たちの実験では、TalkVidでトレーニングされたモデルが、以前のデータセットでトレーニングされたモデルを上回り、優れたクロスデータセット汎化を示すことが実証されました。重要なことに、TalkVid-Benchでの分析は、従来の集計指標では隠されていたサブグループ間のパフォーマンスの差異を明らかにし、将来の研究におけるその必要性を強調しています。コードとデータはhttps://github.com/FreedomIntelligence/TalkVidで見つけることができます。

English

Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid

TalkVid: 音声駆動型トーキングヘッド合成のための大規模多様化データセット

TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

要旨

Support