Video2Roleplay：一個多模態數據集與框架，用於視頻引導的角色扮演代理

摘要

角色扮演代理（RPAs）因其模擬沉浸式與互動性角色的能力而日益受到關注。然而，現有方法主要集中於靜態角色設定，忽視了人類固有的動態感知能力。為彌補這一差距，我們通過將視頻模態融入RPAs，提出了動態角色設定的概念。為此，我們構建了Role-playing-Video60k，這是一個包含6萬個視頻及70萬條對應對話的大規模高質量數據集。基於此數據集，我們開發了一個綜合性的RPA框架，該框架結合了自適應時間採樣與動態及靜態角色設定表示。具體而言，動態設定通過自適應採樣視頻幀並按時間順序輸入大型語言模型（LLM）來創建，而靜態設定則包括（1）微調階段來自訓練視頻的角色對話，以及（2）推理階段輸入視頻的摘要上下文。這種聯合整合使RPAs能夠生成更為豐富的回應。此外，我們提出了一種涵蓋八項指標的穩健評估方法。實驗結果證明了我們框架的有效性，凸顯了動態角色設定在開發RPAs中的重要性。

English

Role-playing agents (RPAs) have attracted growing interest for their ability to simulate immersive and interactive characters. However, existing approaches primarily focus on static role profiles, overlooking the dynamic perceptual abilities inherent to humans. To bridge this gap, we introduce the concept of dynamic role profiles by incorporating video modality into RPAs. To support this, we construct Role-playing-Video60k, a large-scale, high-quality dataset comprising 60k videos and 700k corresponding dialogues. Based on this dataset, we develop a comprehensive RPA framework that combines adaptive temporal sampling with both dynamic and static role profile representations. Specifically, the dynamic profile is created by adaptively sampling video frames and feeding them to the LLM in temporal order, while the static profile consists of (1) character dialogues from training videos during fine-tuning, and (2) a summary context from the input video during inference. This joint integration enables RPAs to generate greater responses. Furthermore, we propose a robust evaluation method covering eight metrics. Experimental results demonstrate the effectiveness of our framework, highlighting the importance of dynamic role profiles in developing RPAs.

Video2Roleplay：一個多模態數據集與框架，用於視頻引導的角色扮演代理

Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents

摘要

Support