Proact-VL：面向即時AI伴侶的主動式影片語言模型

摘要

主動式即時互動體驗對於擬人化AI伴侶至關重要，但仍面臨三大關鍵挑戰：(1) 在連續串流輸入下實現低延遲推理，(2) 自主決定回應時機，(3) 控制生成內容的質與量以滿足即時性要求。本研究透過兩種適合自動評估的遊戲情境（解說員與引導者）來具體實現AI伴侜，提出包含單人解說、雙人解說及用戶引導三大典型場景的大規模即時遊戲基準數據集Live Gaming Benchmark，並建構Proact-VL通用框架，將多模態語言模型轉化為具備人類化環境感知與互動能力的主動式即時交互代理。大量實驗表明，Proact-VL在保持強大視頻理解能力的同時，實現了卓越的回應延遲控制與內容品質，證實其在即時互動應用中的實用價值。

English

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

Proact-VL：面向即時AI伴侶的主動式影片語言模型

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

摘要

Support