VIR-Bench：透過旅行影片行程重建評估多模態大語言模型的地理空間與時間理解能力

摘要

近期，多模态大语言模型（MLLMs）的显著进展极大地提升了视频理解能力，为实际应用开辟了新的可能性。然而，当前的视频基准测试主要集中于室内场景或短距离户外活动，使得与长途旅行相关的挑战在很大程度上未被探索。掌握延长的地理时空轨迹对于下一代MLLMs至关重要，这是实现具身AI规划和导航等现实世界任务的基础。为了填补这一空白，我们提出了VIR-Bench，这是一个由200个旅行视频组成的新颖基准测试，它将行程重建设计为一项挑战性任务，旨在评估并推动MLLMs的地理时空智能。实验结果显示，包括专有模型在内的最先进MLLMs在应对跨越广阔空间和时间尺度的视频时，难以取得高分，突显了处理此类视频的难度。此外，我们进行了一项深入的案例研究，开发了一个原型旅行规划代理，该代理利用了从VIR-Bench中获得的洞见。该代理显著改进的行程推荐验证了我们的评估协议不仅有效地基准测试了模型，还转化为面向用户应用中的具体性能提升。

English

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

VIR-Bench：透過旅行影片行程重建評估多模態大語言模型的地理空間與時間理解能力

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

摘要

Support