正立与否？通过细粒度多轴感知任务解析多模态大语言模型中的方向理解能力

摘要

物体朝向理解是视觉感知中的一项基础性挑战，对于机器人操作和增强现实等应用至关重要。现有的视觉-语言基准测试未能有效隔离这一能力，往往将其与位置关系和整体场景理解混为一谈。我们提出了DORI（判别性朝向推理智能），这是一个全面的基准测试，将物体朝向感知确立为主要评估目标。DORI评估了朝向理解的四个维度：正面对齐、旋转变换、相对方向关系以及标准朝向理解。通过精心设计的任务，涵盖11个数据集中的67个物体类别，跨越合成与真实场景，DORI深入探讨了多模态系统如何理解物体朝向。我们对15种最先进的视觉-语言模型进行评估，揭示了关键局限：即使在粗粒度任务上，最佳模型的准确率也仅为54.2%，而在细粒度朝向判断上降至33.0%，且当任务需要参考系转换或复合旋转时，性能进一步下降。这些发现表明，亟需专门的朝向表示机制，因为模型在精确角度估计、跨视角追踪朝向变化以及理解复合旋转方面表现出系统性不足，暗示了其内部三维空间表示的局限性。作为首个专为多模态系统中的朝向意识设计的诊断框架，DORI为提升机器人控制、三维场景重建以及在物理环境中的人机交互提供了启示。DORI数据访问地址：https://huggingface.co/datasets/appledora/DORI-Benchmark

English

Object orientation understanding represents a fundamental challenge in visual perception critical for applications like robotic manipulation and augmented reality. Current vision-language benchmarks fail to isolate this capability, often conflating it with positional relationships and general scene understanding. We introduce DORI (Discriminative Orientation Reasoning Intelligence), a comprehensive benchmark establishing object orientation perception as a primary evaluation target. DORI assesses four dimensions of orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. Through carefully curated tasks from 11 datasets spanning 67 object categories across synthetic and real-world scenarios, DORI provides insights on how multi-modal systems understand object orientations. Our evaluation of 15 state-of-the-art vision-language models reveals critical limitations: even the best models achieve only 54.2% accuracy on coarse tasks and 33.0% on granular orientation judgments, with performance deteriorating for tasks requiring reference frame shifts or compound rotations. These findings demonstrate the need for dedicated orientation representation mechanisms, as models show systematic inability to perform precise angular estimations, track orientation changes across viewpoints, and understand compound rotations - suggesting limitations in their internal 3D spatial representations. As the first diagnostic framework specifically designed for orientation awareness in multimodal systems, DORI offers implications for improving robotic control, 3D scene reconstruction, and human-AI interaction in physical environments. DORI data: https://huggingface.co/datasets/appledora/DORI-Benchmark

正立与否？通过细粒度多轴感知任务解析多模态大语言模型中的方向理解能力

Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks

摘要

Support