MORSE-500：一个可编程控制的视频基准测试集，用于压力测试多模态推理能力

摘要

尽管视觉-语言模型（VLMs）取得了快速进展，当前的多模态推理基准在三个关键维度上仍显不足。首先，它们过度依赖静态图像，未能捕捉现实世界环境中的时间复杂性。其次，这些基准过于集中于数学问题解决，忽视了包括抽象、物理、规划、空间和时间能力在内的更广泛推理技能，这些是构建强大多模态智能所必需的。第三，许多基准很快达到饱和，为诊断失败模式或衡量持续进步提供的空间有限。我们推出了MORSE-500（多模态推理压力测试环境），这是一个由500个完全脚本化的视频片段组成的基准，涵盖了六个互补的推理类别，每个片段都嵌入了相关问题。每个实例均通过确定性的Python脚本（利用Manim、Matplotlib、MoviePy）、生成式视频模型以及精选的真实素材程序化生成。这种脚本驱动的设计允许对视觉复杂度、干扰物密度及时间动态进行精细控制，使得难度能随模型进步而系统性地提升。与一旦饱和即过时的静态基准不同，MORSE-500旨在持续进化：其可控的生成管道支持创建任意挑战性的新实例，使其非常适合用于压力测试下一代模型。对包括当时最强的Gemini 2.5 Pro和OpenAI o3在内的多种最先进系统，以及强大的开源模型进行的初步实验显示，在所有类别中均存在显著的性能差距，尤其是在抽象和规划任务上表现尤为不足。我们公开了完整的数据集、生成脚本及评估工具，以支持透明、可复现且前瞻性的多模态推理研究。

English

Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.