從感知到行動：視覺推理的互動式基準測試

摘要

理解物理結構對於實體智能體、互動設計及長時序操作等現實應用至關重要。然而，當前主流視覺語言模型的評估仍聚焦於無結構感知的單輪交互設定（如視覺問答），這類設定無法有效評測智能體在動態環境中，如何綜合推理幾何形狀、接觸關係與支撐關係等物理約束對可行動作的聯合限制。為填補此空白，我們提出「動作與互動的因果層級鏈」基準——一個互動式三維物理驅動測試平台，旨在評估模型能否基於物理約束理解、規劃並執行結構化動作序列。CHAIN將評估範式從被動感知轉向主動問題解決，涵蓋連鎖機械拼圖、三維堆疊與裝填等多類任務。我們在統一互動設定下對前沿視覺語言模型與基於擴散技術的模型展開全面研究。結果表明，頂尖模型仍難以內化物理結構與因果約束，常無法生成可靠的長時序計劃，且未能穩健地將感知結構轉化為有效動作。項目詳見：https://social-ai-studio.github.io/CHAIN/。

English

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

從感知到行動：視覺推理的互動式基準測試

From Perception to Action: An Interactive Benchmark for Vision Reasoning

摘要

Support