Proact-VL: リアルタイムAIコンパニオンのためのプロアクティブVideoLLM

要旨

人間らしいAIコンパニオンには、能動的かつリアルタイムな対話体験が不可欠であるが、以下の3つの課題に直面している：(1)連続ストリーミング入力下での低遅延推論の実現、(2)応答タイミングの自律的判断、(3)リアルタイム制約を満たすための生成コンテンツの質と量の制御。本研究では、自動評価に適した二つのゲームシナリオ（実況解説とガイド）を通じてAIコンパニオンを具体化する。大規模データセット「Live Gaming Benchmark」を提案し、単独実況・共同実況・ユーザーガイドという3つの代表シナリオを構築。さらに、マルチモーダル言語モデルを能動的でリアルタイムな対話エージェントへと変換する汎用フレームワーク「Proact-VL」を開発し、人間のような環境知覚と相互作用を実現する。大規模実験により、Proact-VLが優れた応答遅延と品質を達成しつつ、強力な映像理解能力を維持することを実証。リアルタイム対話アプリケーションにおける実用性を立証した。

English

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

Proact-VL: リアルタイムAIコンパニオンのためのプロアクティブVideoLLM

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

要旨

Support