AURA: 비디오 스트림을 통한 상시 이해 및 실시간 지원

초록

비디오 대규모 언어 모델(VideoLLMs)은 많은 비디오 이해 작업에서 강력한 성능을 달성했지만, 대부분의 기존 시스템은 오프라인 상태로 남아 있으며 지속적인 관찰과 적시에 대응이 필요한 실시간 비디오 스트림에는 적합하지 않습니다. 최근 스트리밍 VideoLLM이 발전을 이루었으나, 현재 접근법들은 분리된 트리거-응답 파이프라인에 의존하거나 캡션 형식의 내레이션에 국한되는 경우가 많아 개방형 질의응답 및 장기간 상호작용에 대한 효과성이 낮습니다. 우리는 AURA(Always-On Understanding and Real-Time Assistance)를 제안합니다. AURA는 통합된 VideoLLM이 비디오 스트림을 지속적으로 처리하고 실시간 질의응답과 능동적 응답을 모두 지원할 수 있도록 하는 엔드투엔드 스트리밍 시각 상호작용 프레임워크입니다. AURA는 안정적인 장기간 스트리밍 상호작용을 위해 컨텍스트 관리, 데이터 구성, 훈련 목표 및 배포 최적화를 통합합니다. 이는 스트리밍 벤치마크에서 최첨단 성능을 달성하고, ASR 및 TTS를 탑재한 실시간 데모 시스템을 두 개의 80G 가속기에서 2 FPS로 구동합니다. 향후 연구를 촉진하기 위해 AURA 모델과 실시간 추론 프레임워크를 공개합니다.

English

Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.

AURA: 비디오 스트림을 통한 상시 이해 및 실시간 지원

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

초록

Support