VideoAtlas: 로그 계산량으로 장편 비디오 탐색하기

초록

언어 모델을 비디오로 확장하는 데는 두 가지 과제가 있습니다: 첫째, 기존 방법이 손실이 있는 근사화에 의존하는 표현(representation) 문제, 둘째, 캡션 또는 에이전트 기반 파이프라인이 비디오를 텍스트로 축소하며 시각적 충실도를 잃는 장문 컨텍스트(long-context) 문제입니다. 이를 해결하기 위해 우리는 비디오를 계층적 그리드(hierarchical grid)로 표현하는 작업 독립적(task-agnostic) 환경인 VideoAtlas를 소개합니다. 이는 동시에 무손실(lossless), 탐색 가능(navigable), 확장 가능(scalable)하며 캡션 및 전처리 과정이 필요 없습니다(caption- and preprocessing-free). 비디오 개요를 한눈에 확인할 수 있으며, 모든 영역을 재귀적으로 확대할 수 있고, 동일한 시각적 표현이 비디오 전체, 중간 탐색 과정, 에이전트의 메모리에 일관되게 사용되어 종단간(end-to-end) 손실이 있는 텍스트 변환을 제거합니다. 이 계층적 구조는 접근 깊이가 비디오 길이에 대해 로그 함수적으로만 증가하도록 보장합니다. 장문 컨텍스트 문제에 대해, 재귀 언어 모델(Recursive Language Models, RLMs)은 최근 장문 텍스트를 위한 강력한 해법을 제시했지만, 이를 시각 영역으로 확장하려면 재귀적으로 탐색할 수 있는 구조화된 환경이 필요하며, VideoAtlas가 이를 제공합니다. VideoAtlas를 마르코프 결정 과정(Markov Decision Process)으로 구성하면 Video-RLM을 구현할 수 있습니다. 이는 Master가 전역 탐색을 조정하는 동시에 Worker들이 할당된 영역을 병렬적으로 심층 탐색하여 무손실 시각 증거를 축적하는 병렬 Master-Worker 아키텍처입니다. 우리는 세 가지 핵심 결과를 입증합니다: (1) 그리드 구조의 재사용으로 발생하는 30-60%의 다중모달 캐시 적중률(multimodal cache hit rate)로 인해 더욱 강화되는, 비디오 지속 시간에 대한 로그 함수적 계산 복잡도 증가. (2) 최대 탐색 깊이를 제한하여 계산 정확도 하이퍼파라미터를 체계적으로 조절할 수 있는 환경 예산 설정(environment budgeting). (3) 질문의 세부성에 따라 확장되는 자발적 적응형 계산 할당(emergent adaptive compute allocation). 1시간에서 10시간 벤치마크로 확장할 때, Video-RLM은 정확도 저하가 최소화된 가장 지속 시간에 강건한(duration-robust) 방법으로 남아 있으며, 이는 구조화된 환경 탐색이 비디오 이해를 위한 실현 가능하고 확장 가능한 패러다임임을 입증합니다.

English

Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce VideoAtlas, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent's memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which VideoAtlas provides. VideoAtlas as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60\% multimodal cache hit rate arising from the grid's structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.

VideoAtlas: 로그 계산량으로 장편 비디오 탐색하기

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

초록

Support