AURA:透過視訊串流實現持續感知與即時輔助
AURA: Always-On Understanding and Real-Time Assistance via Video Streams
April 5, 2026
作者: Xudong Lu, Yang Bo, Jinpeng Chen, Shuhan Li, Xintong Guo, Huankang Guan, Fang Liu, Dunyuan Xu, Peiwen Sun, Heyang Sun, Rui Liu, Hongsheng Li
cs.AI
摘要
影片大型語言模型(VideoLLMs)在多項影片理解任務中已展現卓越性能,但現有系統多數仍為離線模式,難以適應需要持續觀察與即時回應的即時影片串流。近期串流式VideoLLMs雖取得進展,但現有方法往往依賴解耦的觸發-回應流程,或僅限於字幕式旁白生成,限制了其在開放式問答與長時序互動中的效能。我們提出AURA(全時理解與即時輔助框架),這是一種端到端的串流視覺互動架構,使統一化的VideoLLM能持續處理影片串流,同時支援即時問答與主動式回應。AURA整合了情境管理、資料建構、訓練目標與部署優化機制,以實現穩定的長時序串流互動。該框架在串流基準測試中達到最先進性能,並支援搭載語音識別(ASR)與語音合成(TTS)的即時演示系統,可在兩張80G加速卡上以2 FPS速率運行。我們同步開源AURA模型與即時推論框架,以促進未來相關研究。
English
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.