MORSE-500：マルチモーダル推論のストレステストを行うためのプログラム制御可能なビデオベンチマーク

要旨

視覚言語モデル（VLM）の急速な進展にもかかわらず、現在のマルチモーダル推論のベンチマークは3つの重要な次元で不十分です。第一に、それらは静的な画像に過度に依存しており、現実世界の環境の時間的複雑さを捉えられていません。第二に、数学的問題解決に狭く焦点を当てており、堅牢なマルチモーダル知能に必要な抽象、物理、計画、空間、時間的能力といった幅広い推論スキルを無視しています。第三に、多くのベンチマークはすぐに飽和し、失敗モードの診断や継続的な進歩の測定に限られた余地しか提供しません。私たちはMORSE-500（Multimodal Reasoning Stress-test Environment）を紹介します。これは、6つの補完的な推論カテゴリにわたる埋め込み質問を含む500の完全なスクリプトクリップからなるビデオベンチマークです。各インスタンスは、決定論的なPythonスクリプト（Manim、Matplotlib、MoviePyを介して）、生成ビデオモデル、およびキュレーションされた実写映像を使用してプログラム的に生成されます。このスクリプト駆動設計により、視覚的複雑さ、ディストラクター密度、時間的ダイナミクスを細かく制御でき、モデルの改善に伴って難易度を体系的にスケーリングすることが可能です。一度飽和すると時代遅れになる静的ベンチマークとは異なり、MORSE-500は進化するように構築されています。その制御可能な生成パイプラインは、任意に挑戦的な新しいインスタンスの作成をサポートし、次世代モデルのストレステストに最適です。最先端システム（当時最強のGemini 2.5 ProやOpenAI o3を含む）と強力なオープンソースモデルを使用した初期実験では、すべてのカテゴリで大きなパフォーマンスギャップが明らかになり、特に抽象と計画タスクで大きな欠陥が見られました。透明性、再現性、将来を見据えたマルチモーダル推論研究を支援するため、完全なデータセット、生成スクリプト、および評価ハーネスを公開します。

English

Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.

MORSE-500：マルチモーダル推論のストレステストを行うためのプログラム制御可能なビデオベンチマーク

MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

要旨

Support