Video2Roleplay: Ein multimodaler Datensatz und Framework für videogeführte Rollenspiel-Agenten

papers.abstract

Rollenspiel-Agenten (RPAs) haben aufgrund ihrer Fähigkeit, immersive und interaktive Charaktere zu simulieren, zunehmendes Interesse geweckt. Bisherige Ansätze konzentrieren sich jedoch hauptsächlich auf statische Rollenprofile und vernachlässigen die dynamischen Wahrnehmungsfähigkeiten, die dem Menschen inhärent sind. Um diese Lücke zu schließen, führen wir das Konzept dynamischer Rollenprofile ein, indem wir die Video-Modalität in RPAs integrieren. Zur Unterstützung dessen erstellen wir Role-playing-Video60k, einen umfangreichen, hochwertigen Datensatz, der 60.000 Videos und 700.000 entsprechende Dialoge umfasst. Basierend auf diesem Datensatz entwickeln wir ein umfassendes RPA-Framework, das adaptives temporales Sampling mit sowohl dynamischen als auch statischen Rollenprofil-Darstellungen kombiniert. Konkret wird das dynamische Profil durch adaptives Sampling von Videobildern erstellt, die in zeitlicher Reihenfolge an das LLM übergeben werden, während das statische Profil aus (1) Charakterdialogen aus Trainingsvideos während des Fine-Tunings und (2) einem Zusammenfassungskontext aus dem Eingabevideo während der Inferenz besteht. Diese gemeinsame Integration ermöglicht es RPAs, bessere Antworten zu generieren. Darüber hinaus schlagen wir eine robuste Evaluationsmethode vor, die acht Metriken abdeckt. Experimentelle Ergebnisse demonstrieren die Effektivität unseres Frameworks und unterstreichen die Bedeutung dynamischer Rollenprofile bei der Entwicklung von RPAs.

English

Role-playing agents (RPAs) have attracted growing interest for their ability to simulate immersive and interactive characters. However, existing approaches primarily focus on static role profiles, overlooking the dynamic perceptual abilities inherent to humans. To bridge this gap, we introduce the concept of dynamic role profiles by incorporating video modality into RPAs. To support this, we construct Role-playing-Video60k, a large-scale, high-quality dataset comprising 60k videos and 700k corresponding dialogues. Based on this dataset, we develop a comprehensive RPA framework that combines adaptive temporal sampling with both dynamic and static role profile representations. Specifically, the dynamic profile is created by adaptively sampling video frames and feeding them to the LLM in temporal order, while the static profile consists of (1) character dialogues from training videos during fine-tuning, and (2) a summary context from the input video during inference. This joint integration enables RPAs to generate greater responses. Furthermore, we propose a robust evaluation method covering eight metrics. Experimental results demonstrate the effectiveness of our framework, highlighting the importance of dynamic role profiles in developing RPAs.

Video2Roleplay: Ein multimodaler Datensatz und Framework für videogeführte Rollenspiel-Agenten

Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents

papers.abstract

Support