知覚から行動へ：視覚推論のためのインタラクティブベンチマーク

要旨

物理的構造の理解は、具身化エージェント、インタラクティブデザイン、長期的な操作計画といった実世界応用において不可欠です。しかし、現在主流の視覚言語モデル（VLM）評価は、構造を考慮しない単一ターンの設定（例：VQA）に依然として焦点を当てており、動的環境において幾何学的関係・接触関係・支持関係が共同で動作可能性に制約を課す仕組みをエージェントが推論する能力を適切に評価できていません。この課題を解決するため、我々はCausal Hierarchy of Actions and Interactions（CHAIN）ベンチマークを提案します。これはインタラクティブな3D物理シミュレーション環境であり、モデルが物理的制約に基づいた構造化された行動系列を理解・計画・実行できるかを評価するために設計されています。CHAINは評価の焦点を受動的知覚から能動的問題解決へと移行し、連動式機械パズルや3D積み上げ・梱包タスクなど多様な課題を網羅します。我々は最先端のVLMおよび拡散モデルを統一されたインタラクティブ設定で包括的に評価しました。その結果、最高性能のモデルであっても物理的構造と因果的制約を内部化することが困難であり、信頼性の高い長期的計画の生成ができず、認識した構造を効果的な行動に頑健に変換できないことが明らかになりました。本プロジェクトはhttps://social-ai-studio.github.io/CHAIN/で公開されています。

English

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

知覚から行動へ：視覚推論のためのインタラクティブベンチマーク

From Perception to Action: An Interactive Benchmark for Vision Reasoning

要旨

Support