Video2Roleplay:一个多模态数据集与视频引导角色扮演代理框架
Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents
September 17, 2025
作者: Xueqiao Zhang, Chao Zhang, Jingtao Xu, Yifan Zhu, Xin Shi, Yi Yang, Yawei Luo
cs.AI
摘要
角色扮演代理(RPAs)因其能够模拟沉浸式和互动性角色而日益受到关注。然而,现有方法主要集中于静态角色设定,忽视了人类固有的动态感知能力。为填补这一空白,我们通过将视频模态融入RPAs,提出了动态角色设定的概念。为此,我们构建了Role-playing-Video60k,这是一个包含6万条视频和70万条对应对话的大规模高质量数据集。基于此数据集,我们开发了一个综合性的RPA框架,该框架结合了自适应时间采样与动态及静态角色设定表示。具体而言,动态设定通过自适应采样视频帧并按时间顺序输入大语言模型(LLM)来创建,而静态设定则包括:(1)微调过程中训练视频中的角色对话,以及(2)推理时输入视频的摘要上下文。这种联合集成使RPAs能够生成更为丰富的响应。此外,我们提出了一种涵盖八项指标的稳健评估方法。实验结果验证了我们框架的有效性,凸显了动态角色设定在开发RPAs中的重要性。
English
Role-playing agents (RPAs) have attracted growing interest for their ability
to simulate immersive and interactive characters. However, existing approaches
primarily focus on static role profiles, overlooking the dynamic perceptual
abilities inherent to humans. To bridge this gap, we introduce the concept of
dynamic role profiles by incorporating video modality into RPAs. To support
this, we construct Role-playing-Video60k, a large-scale, high-quality dataset
comprising 60k videos and 700k corresponding dialogues. Based on this dataset,
we develop a comprehensive RPA framework that combines adaptive temporal
sampling with both dynamic and static role profile representations.
Specifically, the dynamic profile is created by adaptively sampling video
frames and feeding them to the LLM in temporal order, while the static profile
consists of (1) character dialogues from training videos during fine-tuning,
and (2) a summary context from the input video during inference. This joint
integration enables RPAs to generate greater responses. Furthermore, we propose
a robust evaluation method covering eight metrics. Experimental results
demonstrate the effectiveness of our framework, highlighting the importance of
dynamic role profiles in developing RPAs.