AURA：通过视频流实现持续感知与实时辅助

摘要

视频大语言模型（VideoLLMs）已在多项视频理解任务中展现出卓越性能，但现有系统大多为离线模式，难以适应需要持续观察与实时响应的直播视频流。尽管近期流式视频大语言模型取得进展，当前方案仍常依赖解耦的触发-响应流程，或局限于字幕式旁播，限制了其在开放式问答和长程交互中的效能。我们提出AURA（全时感知与实时辅助）——一种端到端的流式视觉交互框架，使统一视频大语言模型能持续处理视频流，同时支持实时问答与主动响应。AURA整合了上下文管理、数据构建、训练目标及部署优化，确保长程流式交互的稳定性。该框架在流式基准测试中达到最先进性能，并支持集成语音识别与合成的实时演示系统，可在双80G加速器上以2帧/秒的速度运行。我们同步开源AURA模型及实时推理框架，以促进未来研究。

English

Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.

AURA：通过视频流实现持续感知与实时辅助

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

摘要

Support