ChatPaper.aiChatPaper

SpeakerVid-5M:大規模高品質音視覺雙人互動人類生成數據集

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

July 14, 2025
作者: Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li
cs.AI

摘要

大規模模型的快速發展,已催化了數位人領域的重大突破。這些先進方法為虛擬形象驅動與渲染提供了高保真解決方案,促使學術界聚焦於下一個主要挑戰:視聽雙向互動虛擬人。為推動這一新興領域的研究,我們推出了SpeakerVid-5M數據集,這是首個專為視聽雙向互動虛擬人生成而設計的大規模高質量數據集。總計超過8,743小時,SpeakerVid-5M包含超過520萬段人像視頻片段,涵蓋了多種規模與互動類型,包括單向講話、聆聽及雙向對話。關鍵在於,該數據集沿兩個核心維度構建:互動類型與數據質量。首先,根據互動場景,將其分為四類(對話分支、單向分支、聆聽分支及多輪分支)。其次,分層為大規模預訓練子集與精選高質量子集,用於監督微調(SFT)。此雙重結構適應了廣泛的二維虛擬人任務。此外,我們基於此數據訓練了一個自回歸(AR)視頻聊天基線,並配套了一套專用指標與測試數據,作為未來工作的基準VidChatBench。數據集及相應的數據處理代碼將公開釋出。項目頁面:https://dorniwang.github.io/SpeakerVid-5M/
English
The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/
PDF433July 15, 2025