迈向基于空间锚定合成世界的机器人具身认知

摘要

我们提出了一种用于训练视觉-语言模型（VLMs）执行视觉视角采择（VPT）的概念框架，这是实现具身认知的核心能力，对人与机器人交互（HRI）至关重要。作为迈向这一目标的第一步，我们引入了一个在NVIDIA Omniverse中生成的合成数据集，该数据集支持空间推理任务的监督学习。每个实例包含一张RGB图像、一段自然语言描述以及一个表示物体姿态的4X4真实变换矩阵。我们专注于推断Z轴距离这一基础技能，未来将扩展至完整的六自由度（6 DOFs）推理。该数据集已公开，以支持进一步研究。本工作为开发能够在人机交互场景中实现空间理解的具身AI系统奠定了重要基础。

English

We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.

迈向基于空间锚定合成世界的机器人具身认知

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

摘要

Support