인지에서 행동으로: 시각 추론을 위한 상호작용형 벤치마크

초록

물리적 구조를 이해하는 것은 구현된 에이전트, 상호작용 설계, 장기간 조작과 같은 실제 응용 분야에서 필수적입니다. 그러나 현재 널리 사용되는 시각-언어 모델(VLM) 평가는 여전히 구조를 고려하지 않는 단일 턴 설정(예: VQA)에 중점을 두고 있어, 기하학적 특성, 접촉 관계, 지지 관계가 동적 환경에서 가능한 행동을 어떻게 함께 제약하는지 에이전트가 추론하는 능력을 평가하지 못합니다. 이러한 격차를 해결하기 위해 우리는 인과적 행동 및 상호작용 계층 구조(CHAIN) 벤치마크를 소개합니다. 이는 물리 기반의 상호작용형 3D 테스트베드로, 모델이 물리적 제약 조건에 기반한 구조화된 행동 시퀀스를 이해하고 계획하며 실행할 수 있는지 평가하도록 설계되었습니다. CHAIN은 수동적 인식에서 능동적 문제 해결로 평가의 초점을 전환하며, 연동 기계식 퍼즐과 3D 쌓기 및 포장 작업과 같은 과제를 포괄합니다. 우리는 최첨단 VLM과 확산 기반 모델들을 통일된 상호작용 설정 하에서 포괄적으로 연구합니다. 우리의 결과에 따르면 최고 성능 모델들도 여전히 물리적 구조와 인과적 제약을 내재화하는 데 어려움을 겪으며, 종종 신뢰할 수 있는 장기 계획을 생성하지 못하고 인지된 구조를 효과적인 행동으로 견고하게 변환하지 못합니다. 본 프로젝트는 https://social-ai-studio.github.io/CHAIN/에서 확인할 수 있습니다.

English

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

인지에서 행동으로: 시각 추론을 위한 상호작용형 벤치마크

From Perception to Action: An Interactive Benchmark for Vision Reasoning

초록

Support