TalkVid: 오디오 기반 토킹 헤드 합성을 위한 대규모 다변화 데이터셋

초록

오디오 기반의 말하는 얼굴 합성 기술은 놀라울 정도의 사실감을 달성했지만, 최첨단(SOTA) 모델들은 중요한 결함을 보입니다: 이들은 인종, 언어, 연령대 등 인간의 다양성 전체를 포괄하는 일반화 능력이 부족합니다. 우리는 이러한 일반화 격차가 기존 훈련 데이터의 한계에서 비롯된 직접적인 증상이라고 주장합니다. 기존 데이터는 필요한 규모, 품질, 다양성을 갖추지 못했습니다. 이 문제를 해결하기 위해 우리는 7729명의 고유한 화자로부터 1244시간 분량의 비디오를 포함한 새로운 대규모, 고품질, 다양한 데이터셋인 TalkVid를 소개합니다. TalkVid는 움직임 안정성, 미적 품질, 얼굴 디테일을 엄격히 필터링하는 원칙 기반의 다단계 자동화 파이프라인을 통해 선별되었으며, 신뢰성을 보장하기 위해 인간의 판단에 대해 검증되었습니다. 더불어, 우리는 주요 인구통계학적 및 언어적 축에 걸쳐 세심하게 균형을 맞춘 500개의 클립으로 구성된 TalkVid-Bench 평가 세트를 구축하고 공개합니다. 우리의 실험은 TalkVid로 훈련된 모델이 이전 데이터셋으로 훈련된 모델들을 능가하며, 우수한 크로스 데이터셋 일반화 능력을 보여줍니다. 특히, TalkVid-Bench에 대한 분석은 전통적인 집계 지표에서는 드러나지 않는 하위 그룹 간의 성능 차이를 밝혀내며, 향후 연구를 위한 이 평가 세트의 필요성을 강조합니다. 코드와 데이터는 https://github.com/FreedomIntelligence/TalkVid에서 확인할 수 있습니다.

English

Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid

TalkVid: 오디오 기반 토킹 헤드 합성을 위한 대규모 다변화 데이터셋

TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

초록

Support