ChatPaper.aiChatPaper

AURA:通过视频流实现持续感知与实时辅助

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

April 5, 2026
作者: Xudong Lu, Yang Bo, Jinpeng Chen, Shuhan Li, Xintong Guo, Huankang Guan, Fang Liu, Dunyuan Xu, Peiwen Sun, Heyang Sun, Rui Liu, Hongsheng Li
cs.AI

摘要

视频大语言模型(VideoLLMs)已在多项视频理解任务中展现出卓越性能,但现有系统大多为离线模式,难以适应需要持续观察与实时响应的直播视频流。尽管近期流式视频大语言模型取得进展,当前方案仍常依赖解耦的触发-响应流程,或局限于字幕式旁播,限制了其在开放式问答和长程交互中的效能。我们提出AURA(全时感知与实时辅助)——一种端到端的流式视觉交互框架,使统一视频大语言模型能持续处理视频流,同时支持实时问答与主动响应。AURA整合了上下文管理、数据构建、训练目标及部署优化,确保长程流式交互的稳定性。该框架在流式基准测试中达到最先进性能,并支持集成语音识别与合成的实时演示系统,可在双80G加速器上以2帧/秒的速度运行。我们同步开源AURA模型及实时推理框架,以促进未来研究。
English
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.
PDF372April 8, 2026