右側が上？MLLMの方向理解を解きほぐす：細粒度の多軸知覚タスクによる分析

要旨

物体の向き理解は、ロボット操作や拡張現実といったアプリケーションにおいて重要な視覚知覚の基本的な課題を表しています。現在の視覚-言語ベンチマークは、この能力を単離することに失敗しており、しばしば位置関係や一般的なシーン理解と混同しています。本研究では、DORI（Discriminative Orientation Reasoning Intelligence）を導入し、物体の向き知覚を主要な評価対象とする包括的なベンチマークを確立します。DORIは、正面方向の整合性、回転変換、相対的な方向関係、および正規の向き理解という4つの次元の向き理解を評価します。合成および現実世界のシナリオにまたがる67の物体カテゴリーからなる11のデータセットを用いて慎重に選定されたタスクを通じて、DORIはマルチモーダルシステムが物体の向きをどのように理解するかについての洞察を提供します。15の最先端の視覚-言語モデルの評価により、重大な限界が明らかになりました：最良のモデルでさえ、粗いタスクでは54.2%、細かい向き判断では33.0%の精度しか達成できず、参照フレームのシフトや複合回転を必要とするタスクでは性能が低下します。これらの発見は、専用の向き表現メカニズムの必要性を示しており、モデルが正確な角度推定を行い、視点間での向き変化を追跡し、複合回転を理解する能力に系統的な欠陥があることを示唆しています。これにより、内部の3D空間表現における限界が示されています。マルチモーダルシステムにおける向き認識に特化した最初の診断フレームワークとして、DORIは、ロボット制御、3Dシーン再構築、物理環境における人間-AIインタラクションの改善に示唆を与えます。DORIデータ: https://huggingface.co/datasets/appledora/DORI-Benchmark

English

Object orientation understanding represents a fundamental challenge in visual perception critical for applications like robotic manipulation and augmented reality. Current vision-language benchmarks fail to isolate this capability, often conflating it with positional relationships and general scene understanding. We introduce DORI (Discriminative Orientation Reasoning Intelligence), a comprehensive benchmark establishing object orientation perception as a primary evaluation target. DORI assesses four dimensions of orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. Through carefully curated tasks from 11 datasets spanning 67 object categories across synthetic and real-world scenarios, DORI provides insights on how multi-modal systems understand object orientations. Our evaluation of 15 state-of-the-art vision-language models reveals critical limitations: even the best models achieve only 54.2% accuracy on coarse tasks and 33.0% on granular orientation judgments, with performance deteriorating for tasks requiring reference frame shifts or compound rotations. These findings demonstrate the need for dedicated orientation representation mechanisms, as models show systematic inability to perform precise angular estimations, track orientation changes across viewpoints, and understand compound rotations - suggesting limitations in their internal 3D spatial representations. As the first diagnostic framework specifically designed for orientation awareness in multimodal systems, DORI offers implications for improving robotic control, 3D scene reconstruction, and human-AI interaction in physical environments. DORI data: https://huggingface.co/datasets/appledora/DORI-Benchmark

右側が上？MLLMの方向理解を解きほぐす：細粒度の多軸知覚タスクによる分析

Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks

要旨

Support