ChatPaper.aiChatPaper

迈向基于空间锚定合成世界的机器人具身认知

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

May 20, 2025
作者: Joel Currie, Gioele Migno, Enrico Piacenti, Maria Elena Giannaccini, Patric Bach, Davide De Tommaso, Agnieszka Wykowska
cs.AI

摘要

我们提出了一种用于训练视觉-语言模型(VLMs)执行视觉视角采择(VPT)的概念框架,这是实现具身认知的核心能力,对人与机器人交互(HRI)至关重要。作为迈向这一目标的第一步,我们引入了一个在NVIDIA Omniverse中生成的合成数据集,该数据集支持空间推理任务的监督学习。每个实例包含一张RGB图像、一段自然语言描述以及一个表示物体姿态的4X4真实变换矩阵。我们专注于推断Z轴距离这一基础技能,未来将扩展至完整的六自由度(6 DOFs)推理。该数据集已公开,以支持进一步研究。本工作为开发能够在人机交互场景中实现空间理解的具身AI系统奠定了重要基础。
English
We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.

Summary

AI-Generated Summary

PDF02May 21, 2025