迈向基于空间锚定合成世界的机器人具身认知
Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds
May 20, 2025
作者: Joel Currie, Gioele Migno, Enrico Piacenti, Maria Elena Giannaccini, Patric Bach, Davide De Tommaso, Agnieszka Wykowska
cs.AI
摘要
我们提出了一种用于训练视觉-语言模型(VLMs)执行视觉视角采择(VPT)的概念框架,这是实现具身认知的核心能力,对人与机器人交互(HRI)至关重要。作为迈向这一目标的第一步,我们引入了一个在NVIDIA Omniverse中生成的合成数据集,该数据集支持空间推理任务的监督学习。每个实例包含一张RGB图像、一段自然语言描述以及一个表示物体姿态的4X4真实变换矩阵。我们专注于推断Z轴距离这一基础技能,未来将扩展至完整的六自由度(6 DOFs)推理。该数据集已公开,以支持进一步研究。本工作为开发能够在人机交互场景中实现空间理解的具身AI系统奠定了重要基础。
English
We present a conceptual framework for training Vision-Language Models (VLMs)
to perform Visual Perspective Taking (VPT), a core capability for embodied
cognition essential for Human-Robot Interaction (HRI). As a first step toward
this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse,
that enables supervised learning for spatial reasoning tasks. Each instance
includes an RGB image, a natural language description, and a ground-truth 4X4
transformation matrix representing object pose. We focus on inferring Z-axis
distance as a foundational skill, with future extensions targeting full 6
Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to
support further research. This work serves as a foundational step toward
embodied AI systems capable of spatial understanding in interactive human-robot
scenarios.Summary
AI-Generated Summary