MORSE-500：一個可程式化控制的影片基準測試，用於壓力測試多模態推理能力

摘要

儘管視覺語言模型（VLMs）取得了快速進展，當前的多模態推理基準在三個關鍵維度上仍顯不足。首先，這些基準過度依賴靜態圖像，未能捕捉現實世界環境中的時間複雜性。其次，它們過於狹隘地聚焦於數學問題解決，忽視了更廣泛的推理技能——包括抽象、物理、規劃、空間和時間能力——這些都是實現強大多模態智能所必需的。第三，許多基準很快達到飽和，為診斷失敗模式或衡量持續進步提供的空間有限。我們引入了MORSE-500（多模態推理壓力測試環境），這是一個由500個完全腳本化的視頻片段組成的基準，這些片段嵌入了跨越六個互補推理類別的問題。每個實例都是通過確定性的Python腳本（使用Manim、Matplotlib、MoviePy）、生成式視頻模型以及精選的真實素材程序化生成的。這種腳本驅動的設計允許對視覺複雜性、干擾物密度和時間動態進行細粒度控制——使得難度能夠隨著模型的改進而系統性地提升。與一旦飽和就過時的靜態基準不同，MORSE-500被設計為可進化：其可控的生成管道支持創建任意挑戰性的新實例，使其非常適合用於壓力測試下一代模型。與最先進系統的初步實驗——包括代表當時最強性能的各種Gemini 2.5 Pro和OpenAI o3，以及強大的開源模型——揭示了在所有類別中存在的顯著性能差距，特別是在抽象和規劃任務上表現出尤為明顯的不足。我們發布了完整的數據集、生成腳本和評估工具，以支持透明、可重現且前瞻性的多模態推理研究。

English

Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.