SoFar:以語言為基礎的定向能力,橋接空間推理與物體操控
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
February 18, 2025
作者: Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi
cs.AI
摘要
空間智能是具身人工智慧的關鍵組成部分,它促使機器人理解並與其環境互動。儘管近期進展提升了視覺語言模型(VLMs)感知物體位置和位置關係的能力,但它們仍缺乏精確理解物體朝向的能力——這對於涉及精細操作的任務至關重要。解決這一限制不僅需要幾何推理,還需要一種表達力強且直觀的表示朝向的方式。在此背景下,我們提出自然語言比規範框架提供了更靈活的表示空間,使其特別適合於遵循指令的機器人系統。本文中,我們引入了語義朝向的概念,它使用自然語言以無參考框架的方式定義物體朝向(例如,USB的「插入」方向或刀具的「手柄」方向)。為支持這一概念,我們構建了OrienText300K,這是一個大規模的三維模型數據集,其中標註了將幾何理解與功能語義相聯繫的語義朝向。通過將語義朝向整合到VLM系統中,我們使機器人能夠生成同時滿足位置和朝向約束的操作動作。在模擬和現實世界中的大量實驗表明,我們的方法顯著提升了機器人的操作能力,例如在Open6DOR上達到48.7%的準確率,在SIMPLER上達到74.9%的準確率。
English
Spatial intelligence is a critical component of embodied AI, promoting robots
to understand and interact with their environments. While recent advances have
enhanced the ability of VLMs to perceive object locations and positional
relationships, they still lack the capability to precisely understand object
orientations-a key requirement for tasks involving fine-grained manipulations.
Addressing this limitation not only requires geometric reasoning but also an
expressive and intuitive way to represent orientation. In this context, we
propose that natural language offers a more flexible representation space than
canonical frames, making it particularly suitable for instruction-following
robotic systems. In this paper, we introduce the concept of semantic
orientation, which defines object orientations using natural language in a
reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the
''handle'' direction of a knife). To support this, we construct OrienText300K,
a large-scale dataset of 3D models annotated with semantic orientations that
link geometric understanding to functional semantics. By integrating semantic
orientation into a VLM system, we enable robots to generate manipulation
actions with both positional and orientational constraints. Extensive
experiments in simulation and real world demonstrate that our approach
significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy
on Open6DOR and 74.9% accuracy on SIMPLER.Summary
AI-Generated Summary