展開空間認知:基於視覺模擬的多模態模型評估
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations
June 5, 2025
作者: Linjie Li, Mahtab Bigverdi, Jiawei Gu, Zixian Ma, Yinuo Yang, Ziang Li, Yejin Choi, Ranjay Krishna
cs.AI
摘要
空間認知對於人類智能至關重要,它使得我們能夠通過視覺模擬來解決問題,而不僅僅依賴於言語推理。然而,現有的人工智能基準測試主要評估言語推理能力,忽略了非言語、多步驟視覺模擬的複雜性。我們提出了STARE(空間變換與推理評估),這是一個旨在嚴格評估多模態大語言模型在更適合通過多步驟視覺模擬解決的任務上的基準測試。STARE包含4K個任務,涵蓋基礎幾何變換(2D和3D)、綜合空間推理(立方體網格折疊和七巧板拼圖)以及現實世界中的空間推理(透視和時間推理),反映了如物體組裝、機械圖解讀和日常空間導航等實際認知挑戰。我們的評估顯示,模型在較簡單的2D變換推理上表現出色,但在需要多步驟視覺模擬的更複雜任務(如3D立方體網格折疊和七巧板拼圖)上表現接近隨機猜測。人類在複雜任務上能達到近乎完美的準確率,但耗時較長(最多28.9秒),而通過中間視覺模擬能顯著加快速度(平均減少7.5秒)。相比之下,模型在視覺模擬上的性能提升不一致,在多數任務上有所改善,但在特定情況下(如七巧板拼圖中的GPT-4o、o1和立方體網格折疊中的Claude-3.5、Gemini-2.0 Flash)表現下降,表明模型可能不知道如何有效利用中間視覺信息。
English
Spatial cognition is essential for human intelligence, enabling
problem-solving through visual simulations rather than solely relying on verbal
reasoning. However, existing AI benchmarks primarily assess verbal reasoning,
neglecting the complexities of non-verbal, multi-step visual simulation. We
introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark
designed to rigorously evaluate multimodal large language models on tasks
better solved through multi-step visual simulation. STARE features 4K tasks
spanning foundational geometric transformations (2D and 3D), integrated spatial
reasoning (cube net folding and tangram puzzles), and real-world spatial
reasoning (perspective and temporal reasoning), reflecting practical cognitive
challenges like object assembly, mechanical diagram interpretation, and
everyday spatial navigation. Our evaluations show that models excel at
reasoning over simpler 2D transformations, but perform close to random chance
on more complex tasks like 3D cube net folding and tangram puzzles that require
multi-step visual simulations. Humans achieve near-perfect accuracy but take
considerable time (up to 28.9s) on complex tasks, significantly speeding up
(down by 7.5 seconds on average) with intermediate visual simulations. In
contrast, models exhibit inconsistent performance gains from visual
simulations, improving on most tasks but declining in specific cases like
tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0
Flash), indicating that models may not know how to effectively leverage
intermediate visual information.