MORSE-500: 다중 모달 추론을 스트레스 테스트하기 위한 프로그래밍 방식으로 제어 가능한 비디오 벤치마크

초록

비전-언어 모델(VLMs)의 급속한 발전에도 불구하고, 현재의 다중모달 추론 벤치마크는 세 가지 주요 측면에서 부족함을 보입니다. 첫째, 이들은 대부분 정적 이미지에 의존하여 실제 세계 환경의 시간적 복잡성을 포착하지 못합니다. 둘째, 이들은 수학적 문제 해결에만 초점을 맞추어 강력한 다중모달 지능을 위해 필요한 추론 능력의 광범위한 스펙트럼 — 추상적, 물리적, 계획, 공간적, 시간적 능력 등 — 을 간과합니다. 셋째, 많은 벤치마크가 빠르게 포화 상태에 이르러 실패 모드를 진단하거나 지속적인 진전을 측정하기에 제한된 여지를 제공합니다. 우리는 MORSE-500(Multimodal Reasoning Stress-test Environment)를 소개합니다. 이는 500개의 완전히 스크립트된 클립으로 구성된 비디오 벤치마크로, 여섯 가지 상호 보완적인 추론 범주에 걸쳐 내장된 질문을 포함합니다. 각 인스턴스는 결정론적 Python 스크립트(Manim, Matplotlib, MoviePy를 통해), 생성형 비디오 모델, 그리고 선별된 실제 영상을 사용하여 프로그래밍 방식으로 생성됩니다. 이 스크립트 기반 설계는 시각적 복잡성, 방해 요소 밀도, 시간적 역학에 대한 세밀한 제어를 가능하게 하여 모델이 개선됨에 따라 난이도를 체계적으로 조정할 수 있게 합니다. 한번 포화되면 더 이상 사용되지 않는 정적 벤치마크와 달리, MORSE-500은 진화하도록 설계되었습니다: 그 제어 가능한 생성 파이프라인은 임의로 어려운 새로운 인스턴스의 생성을 지원하여 차세대 모델의 스트레스 테스트에 이상적으로 적합합니다. 최첨단 시스템 — 당시 가장 강력한 Gemini 2.5 Pro와 OpenAI o3를 포함한 다양한 모델과 강력한 오픈소스 모델 — 을 사용한 초기 실험은 모든 범주에서 상당한 성능 격차를 보여주며, 특히 추상적 및 계획 작업에서 큰 결함을 드러냅니다. 우리는 투명하고 재현 가능하며 미래 지향적인 다중모달 추론 연구를 지원하기 위해 전체 데이터셋, 생성 스크립트, 평가 도구를 공개합니다.

English

Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.

MORSE-500: 다중 모달 추론을 스트레스 테스트하기 위한 프로그래밍 방식으로 제어 가능한 비디오 벤치마크

MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

초록

Support