SpeakerVid-5M：面向视听双人交互式人类生成的大规模高质量数据集

摘要

大规模模型的快速发展推动了数字人领域的重大突破。这些先进方法为虚拟形象驱动与渲染提供了高保真解决方案，促使学术界将目光投向下一重大挑战：视听双模态交互式虚拟人。为促进这一新兴领域的研究，我们推出了SpeakerVid-5M数据集，这是首个专为视听双模态交互式虚拟人生成设计的大规模高质量数据集。总计超过8,743小时，SpeakerVid-5M包含超过520万个人物肖像视频片段，涵盖了多种规模及交互类型，包括单人讲话、倾听及双人对话。尤为关键的是，该数据集沿两个核心维度构建：交互类型与数据质量。首先，依据交互场景，将其划分为四类（对话分支、单分支、倾听分支及多轮分支）。其次，数据集被分层为大规模预训练子集和经过精心筛选的高质量子集，用于监督微调（SFT）。这种双重结构适应了广泛的2D虚拟人任务需求。此外，我们基于此数据训练了一个自回归（AR）视频聊天基线模型，并配套了一套专用指标与测试数据，作为未来工作的基准VidChatBench。数据集及其相应的数据处理代码将公开发布。项目页面：https://dorniwang.github.io/SpeakerVid-5M/

English

The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/

SpeakerVid-5M：面向视听双人交互式人类生成的大规模高质量数据集

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

摘要

Support