空間認知の展開：視覚シミュレーションにおけるマルチモーダルモデルの評価

要旨

空間認知は人間の知能において不可欠であり、言語的推論に頼るだけでなく、視覚的シミュレーションを通じて問題解決を可能にします。しかし、既存のAIベンチマークは主に言語的推論を評価しており、非言語的で多段階の視覚的シミュレーションの複雑さを軽視しています。本論文では、STARE（Spatial Transformations and Reasoning Evaluation）を紹介します。これは、多段階の視覚的シミュレーションによってより良く解決されるタスクにおいて、マルチモーダル大規模言語モデルを厳密に評価するために設計されたベンチマークです。STAREは、基礎的な幾何学的変換（2Dおよび3D）、統合された空間推論（立方体の展開図折り畳みやタングラムパズル）、および実世界の空間推論（視点と時間的推論）にわたる4,000のタスクを特徴とし、物体の組み立て、機械図面の解釈、日常的な空間ナビゲーションなどの実践的な認知的課題を反映しています。評価結果によると、モデルは単純な2D変換の推論において優れていますが、多段階の視覚的シミュレーションを必要とする3D立方体の展開図折り畳みやタングラムパズルなどの複雑なタスクではほぼランダムな確率に近い性能を示します。人間は複雑なタスクでほぼ完璧な精度を達成しますが、相当な時間（最大28.9秒）を要し、中間的な視覚的シミュレーションによって大幅に時間を短縮します（平均7.5秒短縮）。一方、モデルは視覚的シミュレーションからの性能向上が一貫せず、ほとんどのタスクで改善が見られるものの、タングラムパズル（GPT-4o, o1）や立方体の展開図折り畳み（Claude-3.5, Gemini-2.0 Flash）などの特定のケースでは性能が低下し、モデルが中間的な視覚情報を効果的に活用する方法を知らない可能性を示唆しています。

English

Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 4K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.

空間認知の展開：視覚シミュレーションにおけるマルチモーダルモデルの評価

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

要旨

Support