Proact-VL: 실시간 AI 동반자를 위한 선제적 VideoLLM

초록

사람과 같은 AI 동반자를 구현하기 위해서는 사전적이고 실시간적인 상호작용 경험이 필수적이지만, 여기에는 세 가지 주요 과제가 존재합니다: (1) 연속적인 스트리밍 입력 조건에서의 저지연 추론 달성, (2) 응답 시점의 자율적 결정, (3) 실시간 제약 조건을 충족하기 위해 생성 콘텐츠의 질과 양을 동시에 제어. 본 연구에서는 자동 평가에 적합한 두 가지 게임 시나리오, 즉 해설자와 가이드를 통해 AI 동반자를 구체화합니다. 우리는 단독 해설, 공동 해설, 사용자 안내라는 세 가지 대표 시나리오를 포함한 대규모 데이터셋인 Live Gaming Benchmark를 소개하고, 다중 모달 언어 모델을 인간과 유사한 환경 인식 및 상호작용이 가능한 사전적 실시간 상호작용 에이전트로 변환하는 일반 프레임워크인 Proact-VL을 제안합니다. 대규모 실험을 통해 Proact-VL이 우수한 비디오 이해 능력을 유지하면서도 응답 지연 시간과 품질에서 뛰어난 성능을 달성함을 보여주며, 실시간 상호작용 애플리케이션으로서의 실용성을 입증합니다.

English

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

Proact-VL: 실시간 AI 동반자를 위한 선제적 VideoLLM

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

초록

Support