邁向基於空間基礎合成世界的機器人具身認知

摘要

我們提出了一個概念框架，用於訓練視覺-語言模型（VLMs）執行視角採納（VPT），這是實現具身認知的核心能力，對於人機互動（HRI）至關重要。作為實現這一目標的第一步，我們引入了一個在NVIDIA Omniverse中生成的合成數據集，該數據集支持空間推理任務的監督學習。每個實例包含一張RGB圖像、一段自然語言描述以及一個表示物體姿態的4X4真實變換矩陣。我們專注於推斷Z軸距離作為基礎技能，未來將擴展至完整的六自由度（6 DOFs）推理。該數據集已公開，以支持進一步研究。這項工作為開發能夠在互動人機場景中進行空間理解的具身AI系統奠定了基礎。

English

We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.

邁向基於空間基礎合成世界的機器人具身認知

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

摘要

Support