SoFar : L'orientation ancrée dans le langage relie le raisonnement spatial et la manipulation d'objets

papers.abstract

L'intelligence spatiale est un composant essentiel de l'IA incarnée, permettant aux robots de comprendre et d'interagir avec leur environnement. Bien que les progrès récents aient amélioré la capacité des modèles de vision et langage (VLMs) à percevoir les positions des objets et leurs relations spatiales, ils manquent encore de précision dans la compréhension des orientations des objets - une exigence clé pour les tâches impliquant des manipulations fines. Surmonter cette limitation nécessite non seulement un raisonnement géométrique, mais aussi une manière expressive et intuitive de représenter l'orientation. Dans ce contexte, nous proposons que le langage naturel offre un espace de représentation plus flexible que les cadres de référence canoniques, le rendant particulièrement adapté aux systèmes robotiques suivant des instructions. Dans cet article, nous introduisons le concept d'orientation sémantique, qui définit les orientations des objets en utilisant le langage naturel de manière indépendante d'un cadre de référence (par exemple, la direction "d'insertion" d'une clé USB ou la direction "de la poignée" d'un couteau). Pour soutenir cela, nous construisons OrienText300K, un ensemble de données à grande échelle de modèles 3D annotés avec des orientations sémantiques qui relient la compréhension géométrique à la sémantique fonctionnelle. En intégrant l'orientation sémantique dans un système VLM, nous permettons aux robots de générer des actions de manipulation avec des contraintes à la fois positionnelles et orientationnelles. Des expériences approfondies en simulation et dans le monde réel démontrent que notre approche améliore significativement les capacités de manipulation robotique, par exemple, avec une précision de 48,7% sur Open6DOR et de 74,9% sur SIMPLER.

English

Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.

SoFar : L'orientation ancrée dans le langage relie le raisonnement spatial et la manipulation d'objets

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

papers.abstract

Support