Proact-VL：面向实时AI伴侣的主动式视频大语言模型

摘要

主动式实时交互体验是拟人化AI伴侣的关键，但面临三大挑战：(1)在持续流式输入下实现低延迟推理；(2)自主决策响应时机；(3)在实时约束下同时控制生成内容的质量与数量。本研究通过解说员与向导两种游戏场景实例化AI伴侣，这两种场景因其适合自动评估而被选用。我们提出Live Gaming Benchmark——一个包含单人解说、双人解说和用户引导三大典型场景的大规模数据集，并推出Proact-VL通用框架，将多模态大语言模型塑造为具备类人环境感知与交互能力的主动式实时交互智能体。大量实验表明，Proact-VL在保持强大视频理解能力的同时，实现了卓越的响应延迟与质量，证明了其在实时交互应用中的实用性。

English

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

Proact-VL：面向实时AI伴侣的主动式视频大语言模型

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

摘要

Support