空间认知的展开:基于视觉模拟的多模态模型评估
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations
June 5, 2025
作者: Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, Ranjay Krishna
cs.AI
摘要
空间认知是人类智能的核心,它使我们能够通过视觉模拟而非单纯依赖言语推理来解决问题。然而,现有的AI基准测试主要评估言语推理能力,忽视了非语言、多步骤视觉模拟的复杂性。为此,我们推出了STARE(空间变换与推理评估),这是一个旨在严格评估多模态大语言模型在更适于通过多步骤视觉模拟解决的任务上表现的基准。STARE包含4000项任务,涵盖基础几何变换(2D与3D)、综合空间推理(立方体展开图与七巧板拼图)以及现实世界空间推理(透视与时间推理),这些任务反映了诸如物体组装、机械图纸解读及日常空间导航等实际认知挑战。我们的评估显示,模型在较简单的2D变换推理上表现出色,但在需要多步骤视觉模拟的复杂任务,如3D立方体展开图与七巧板拼图上,表现接近随机猜测。人类在这些复杂任务上虽能达到近乎完美的准确率,但耗时较长(最多28.9秒),而借助中间视觉模拟可显著加快速度(平均减少7.5秒)。相比之下,模型在视觉模拟带来的性能提升上表现不一,多数任务有所改善,但在特定情况下如七巧板拼图(GPT-4o, o1)和立方体展开图(Claude-3.5, Gemini-2.0 Flash)中反而下降,这表明模型可能尚未掌握如何有效利用中间视觉信息。
English
Spatial cognition is essential for human intelligence, enabling
problem-solving through visual simulations rather than solely relying on verbal
reasoning. However, existing AI benchmarks primarily assess verbal reasoning,
neglecting the complexities of non-verbal, multi-step visual simulation. We
introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark
designed to rigorously evaluate multimodal large language models on tasks
better solved through multi-step visual simulation. STARE features 4K tasks
spanning foundational geometric transformations (2D and 3D), integrated spatial
reasoning (cube net folding and tangram puzzles), and real-world spatial
reasoning (perspective and temporal reasoning), reflecting practical cognitive
challenges like object assembly, mechanical diagram interpretation, and
everyday spatial navigation. Our evaluations show that models excel at
reasoning over simpler 2D transformations, but perform close to random chance
on more complex tasks like 3D cube net folding and tangram puzzles that require
multi-step visual simulations. Humans achieve near-perfect accuracy but take
considerable time (up to 28.9s) on complex tasks, significantly speeding up
(down by 7.5 seconds on average) with intermediate visual simulations. In
contrast, models exhibit inconsistent performance gains from visual
simulations, improving on most tasks but declining in specific cases like
tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0
Flash), indicating that models may not know how to effectively leverage
intermediate visual information.