VIR-Bench: 여행 동영상 일정 재구성을 통해 MLLM의 지리적 및 시간적 이해력 평가

초록

최근 멀티모달 대형 언어 모델(MLLM)의 발전은 비디오 이해 능력을 크게 향상시켜 실용적인 응용 분야에 새로운 가능성을 열어주었습니다. 그러나 현재의 비디오 벤치마크는 주로 실내 장면이나 단거리 야외 활동에 초점을 맞추고 있어, 장거리 여행과 관련된 도전 과제는 크게 탐구되지 않고 있습니다. 확장된 지리-시간적 궤적을 마스터하는 것은 차세대 MLLM에 있어 매우 중요하며, 구체화된 AI 계획 및 내비게이션과 같은 실제 세계의 작업을 뒷받침합니다. 이러한 격차를 해소하기 위해, 우리는 200개의 여행 비디오로 구성된 새로운 벤치마크인 VIR-Bench를 제안합니다. 이 벤치마크는 여정 재구성을 MLLM의 지리-시간적 지능을 평가하고 발전시키기 위한 도전적인 과제로 설정합니다. 실험 결과, 최신 MLLM(상용 모델 포함)이 높은 점수를 달성하는 데 어려움을 겪는 것으로 나타나, 확장된 공간 및 시간 규모를 다루는 비디오의 어려움을 강조합니다. 또한, 우리는 VIR-Bench에서 얻은 통찰력을 활용한 프로토타입 여행 계획 에이전트를 개발하는 심층 사례 연구를 수행했습니다. 이 에이전트의 크게 개선된 여정 추천은 우리의 평가 프로토콜이 모델을 효과적으로 벤치마킹할 뿐만 아니라 사용자 중심 응용 프로그램에서 구체적인 성능 향상으로 이어짐을 검증합니다.

English

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

VIR-Bench: 여행 동영상 일정 재구성을 통해 MLLM의 지리적 및 시간적 이해력 평가

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

초록

Support