Van Waarneming naar Actie: Een Interactieve Benchmark voor Visueel Redeneren

Samenvatting

Het begrijpen van de fysieke structuur is essentieel voor praktische toepassingen zoals belichaamde agenten, interactief ontwerp en manipulatie op lange termijn. Toch richten gangbare evaluaties van Vision-Language Models (VLM's) zich nog steeds op structuuronafhankelijke, enkelvoudige opzetten (bijvoorbeeld VQA), die niet het vermogen beoordelen van agenten om te redeneren over hoe geometrie, contact- en ondersteuningsrelaties gezamenlijk beperken welke acties mogelijk zijn in een dynamische omgeving. Om deze kloof te dichten, introduceren we de Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, een interactieve 3D, fysica-gestuurde testomgeving ontworpen om te evalueren of modellen gestructureerde actiereeksen, gebaseerd op fysieke beperkingen, kunnen begrijpen, plannen en uitvoeren. CHAIN verschuift de evaluatie van passieve waarneming naar actief probleemoplossen, met taken zoals in elkaar grijpende mechanische puzzels en 3D-stapelen en inpakken. We voeren een uitgebreide studie uit van state-of-the-art VLM's en op diffusie gebaseerde modellen in uniforme interactieve settings. Onze resultaten tonen aan dat toonaangevende modellen nog steeds moeite hebben om fysieke structuur en causale beperkingen te internaliseren, vaak falen in het produceren van betrouwbare lange-termijnplannen en niet robuust waargenomen structuur kunnen vertalen naar effectieve acties. Het project is beschikbaar op https://social-ai-studio.github.io/CHAIN/.

English

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

Van Waarneming naar Actie: Een Interactieve Benchmark voor Visueel Redeneren

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Samenvatting

Support