从感知到行动：视觉推理的交互式基准测试

摘要

理解物理结构对于具身智能体、交互式设计及长程操作等现实应用至关重要。然而，当前主流的视觉语言模型评估仍集中于与结构无关的单轮测试（如视觉问答），无法评估智能体在动态环境中综合推理几何关系、接触关系与支撑关系如何共同制约可行行动的能力。为填补这一空白，我们推出因果行动与交互层级基准——一个基于交互式三维物理环境的测试平台，旨在评估模型能否理解、规划并执行基于物理约束的结构化动作序列。该基准将评估重点从被动感知转向主动问题解决，涵盖机械联锁拼图、三维堆叠与装箱等任务。我们在统一交互设置下对前沿视觉语言模型和扩散模型进行了全面研究。结果表明，顶尖模型仍难以内化物理结构与因果约束，常无法生成可靠的长程规划，亦不能稳健地将感知结构转化为有效行动。项目地址：https://social-ai-studio.github.io/CHAIN/。

English

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.