VIR-Bench:通过旅行视频行程重建评估多模态大语言模型的地理空间与时间理解能力
VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction
September 23, 2025
作者: Hao Wang, Eiki Murata, Lingfang Zhang, Ayako Sato, So Fukuda, Ziqi Yin, Wentao Hu, Keisuke Nakao, Yusuke Nakamura, Sebastian Zwirner, Yi-Chia Chen, Hiroyuki Otomo, Hiroki Ouchi, Daisuke Kawahara
cs.AI
摘要
近期,多模态大语言模型(MLLMs)的进展显著提升了视频理解能力,为实际应用开辟了新天地。然而,当前的视频基准测试主要集中于室内场景或短距离户外活动,对长距离旅行相关的挑战则鲜有涉猎。掌握延展的地理时空轨迹对于下一代MLLMs至关重要,它是实现诸如具身AI规划与导航等现实任务的基础。为填补这一空白,我们推出了VIR-Bench,一个包含200段旅行视频的创新基准测试,将行程重建设定为一项挑战性任务,旨在评估并推动MLLMs的地理时空智能。实验结果显示,包括专有模型在内的最先进MLLMs在应对跨越广阔时空尺度的视频时,难以取得高分,凸显了处理此类视频的难度。此外,我们开展了一项深入案例研究,开发了一个原型旅行规划代理,该代理充分利用了从VIR-Bench中获得的洞见。该代理在行程推荐上的显著改进验证了我们的评估协议不仅有效基准化了模型,还转化为面向用户应用的具体性能提升。
English
Recent advances in multimodal large language models (MLLMs) have
significantly enhanced video understanding capabilities, opening new
possibilities for practical applications. Yet current video benchmarks focus
largely on indoor scenes or short-range outdoor activities, leaving the
challenges associated with long-distance travel largely unexplored. Mastering
extended geospatial-temporal trajectories is critical for next-generation
MLLMs, underpinning real-world tasks such as embodied-AI planning and
navigation. To bridge this gap, we present VIR-Bench, a novel benchmark
consisting of 200 travel videos that frames itinerary reconstruction as a
challenging task designed to evaluate and push forward MLLMs'
geospatial-temporal intelligence. Experimental results reveal that
state-of-the-art MLLMs, including proprietary ones, struggle to achieve high
scores, underscoring the difficulty of handling videos that span extended
spatial and temporal scales. Moreover, we conduct an in-depth case study in
which we develop a prototype travel-planning agent that leverages the insights
gained from VIR-Bench. The agent's markedly improved itinerary recommendations
verify that our evaluation protocol not only benchmarks models effectively but
also translates into concrete performance gains in user-facing applications.