AURA: Immer-An-Verständnis und Echtzeit-Unterstützung via Videostreams

Zusammenfassung

Video Large Language Models (VideoLLMs) erzielen bei vielen Videoanalyseaufgaben beeindruckende Ergebnisse, doch die meisten bestehenden Systeme arbeiten offline und sind für Live-Videostreams, die kontinuierliche Beobachtung und zeitnahe Reaktion erfordern, ungeeignet. Neuere Streaming-VideoLLMs haben Fortschritte erzielt, jedoch basieren aktuelle Ansätze oft auf entkoppelten Trigger-Response-Pipelines oder sind auf narrationsartige Beschreibungen beschränkt, was ihre Effektivität für offene Frage-Antwort-Aufgaben und langfristige Interaktionen verringert. Wir stellen AURA (Always-On Understanding and Real-Time Assistance) vor, ein End-to-End-Streaming-Visualisierungs-Framework, das einem einheitlichen VideoLLM ermöglicht, Videostreams kontinuierlich zu verarbeiten und sowohl Echtzeit-Fragebeantwortung als auch proaktive Reaktionen zu unterstützen. AURA integriert Kontextmanagement, Datenerstellung, Trainingsziele und Bereitstellungsoptimierung für stabile langfristige Streaming-Interaktionen. Es erzielt state-of-the-art Leistung in Streaming-Benchmarks und unterstützt ein Echtzeit-Demosystem mit Spracherkennung (ASR) und Sprachsynthese (TTS), das auf zwei 80G-Beschleunigern mit 2 FPS läuft. Wir veröffentlichen das AURA-Modell zusammen mit einem Echtzeit-Inferenz-Framework, um zukünftige Forschung zu fördern.

English

Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.

AURA: Immer-An-Verständnis und Echtzeit-Unterstützung via Videostreams

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

Zusammenfassung

Support