Rechtop? Het Ontrafelen van Oriëntatiebegrip in MLLMs met Fijnmazige Multi-as Waarnemingstaken

Samenvatting

Het begrijpen van objectoriëntatie vormt een fundamentele uitdaging in visuele perceptie die cruciaal is voor toepassingen zoals robotmanipulatie en augmented reality. Huidige benchmarks voor visie en taal slagen er niet in om deze vaardigheid te isoleren, waarbij deze vaak verward wordt met positionele relaties en algemene scènebegrip. Wij introduceren DORI (Discriminative Orientation Reasoning Intelligence), een uitgebreide benchmark die objectoriëntatieperceptie als primair evaluatiedoel stelt. DORI beoordeelt vier dimensies van oriëntatiebegrip: frontale uitlijning, rotatietransformaties, relatieve richtingsrelaties en begrip van canonieke oriëntatie. Door zorgvuldig samengestelde taken uit 11 datasets, die 67 objectcategorieën omvatten in zowel synthetische als realistische scenario's, biedt DORI inzicht in hoe multimodale systemen objectoriëntaties begrijpen. Onze evaluatie van 15 state-of-the-art visie-taalmodellen onthult kritieke beperkingen: zelfs de beste modellen behalen slechts 54,2% nauwkeurigheid op grove taken en 33,0% op gedetailleerde oriëntatiebeoordelingen, waarbij de prestaties verslechteren voor taken die vereisen dat referentiekaders worden verschoven of samengestelde rotaties worden uitgevoerd. Deze bevindingen tonen de noodzaak aan van toegewijde mechanismen voor oriëntatierepresentatie, aangezien modellen systematisch niet in staat blijken om precieze hoekschattingen uit te voeren, oriëntatieveranderingen over gezichtspunten te volgen en samengestelde rotaties te begrijpen – wat wijst op beperkingen in hun interne 3D-ruimtelijke representaties. Als het eerste diagnostische raamwerk dat specifiek is ontworpen voor oriëntatiebewustzijn in multimodale systemen, biedt DORI implicaties voor het verbeteren van robotbesturing, 3D-scène-reconstructie en mens-AI-interactie in fysieke omgevingen. DORI-data: https://huggingface.co/datasets/appledora/DORI-Benchmark

English

Object orientation understanding represents a fundamental challenge in visual perception critical for applications like robotic manipulation and augmented reality. Current vision-language benchmarks fail to isolate this capability, often conflating it with positional relationships and general scene understanding. We introduce DORI (Discriminative Orientation Reasoning Intelligence), a comprehensive benchmark establishing object orientation perception as a primary evaluation target. DORI assesses four dimensions of orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. Through carefully curated tasks from 11 datasets spanning 67 object categories across synthetic and real-world scenarios, DORI provides insights on how multi-modal systems understand object orientations. Our evaluation of 15 state-of-the-art vision-language models reveals critical limitations: even the best models achieve only 54.2% accuracy on coarse tasks and 33.0% on granular orientation judgments, with performance deteriorating for tasks requiring reference frame shifts or compound rotations. These findings demonstrate the need for dedicated orientation representation mechanisms, as models show systematic inability to perform precise angular estimations, track orientation changes across viewpoints, and understand compound rotations - suggesting limitations in their internal 3D spatial representations. As the first diagnostic framework specifically designed for orientation awareness in multimodal systems, DORI offers implications for improving robotic control, 3D scene reconstruction, and human-AI interaction in physical environments. DORI data: https://huggingface.co/datasets/appledora/DORI-Benchmark

Rechtop? Het Ontrafelen van Oriëntatiebegrip in MLLMs met Fijnmazige Multi-as Waarnemingstaken

Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks

Samenvatting

Support