VIR-Bench: 旅行動画の旅程再構成によるMLLMの地理空間的・時間的理解の評価

要旨

近年のマルチモーダル大規模言語モデル（MLLM）の進展により、ビデオ理解能力が大幅に向上し、実用的なアプリケーションの新たな可能性が開かれています。しかし、現在のビデオベンチマークは主に屋内シーンや短距離の屋外活動に焦点を当てており、長距離移動に関連する課題はほとんど未開拓のままです。次世代のMLLMにとって、広範な地理的・時間的軌跡を習得することは、エンボディドAIの計画やナビゲーションといった現実世界のタスクを支える上で極めて重要です。このギャップを埋めるため、我々はVIR-Benchという新しいベンチマークを提案します。これは200の旅行ビデオから構成され、旅程再構築をMLLMの地理的・時間的知能を評価し、前進させるための挑戦的なタスクとして位置づけます。実験結果から、最先端のMLLM（プロプライエタリなものも含む）が高得点を達成するのに苦戦することが明らかになり、広範な空間的・時間的スケールにわたるビデオを扱うことの難しさが浮き彫りになりました。さらに、我々はVIR-Benchから得られた知見を活用したプロトタイプ旅行計画エージェントの詳細なケーススタディを実施しました。このエージェントの大幅に改善された旅程推奨は、我々の評価プロトコルがモデルを効果的にベンチマークするだけでなく、ユーザー向けアプリケーションにおける具体的な性能向上にもつながることを実証しています。

English

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

VIR-Bench: 旅行動画の旅程再構成によるMLLMの地理空間的・時間的理解の評価

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

要旨

Support