邁向基於空間基礎合成世界的機器人具身認知
Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds
May 20, 2025
作者: Joel Currie, Gioele Migno, Enrico Piacenti, Maria Elena Giannaccini, Patric Bach, Davide De Tommaso, Agnieszka Wykowska
cs.AI
摘要
我們提出了一個概念框架,用於訓練視覺-語言模型(VLMs)執行視角採納(VPT),這是實現具身認知的核心能力,對於人機互動(HRI)至關重要。作為實現這一目標的第一步,我們引入了一個在NVIDIA Omniverse中生成的合成數據集,該數據集支持空間推理任務的監督學習。每個實例包含一張RGB圖像、一段自然語言描述以及一個表示物體姿態的4X4真實變換矩陣。我們專注於推斷Z軸距離作為基礎技能,未來將擴展至完整的六自由度(6 DOFs)推理。該數據集已公開,以支持進一步研究。這項工作為開發能夠在互動人機場景中進行空間理解的具身AI系統奠定了基礎。
English
We present a conceptual framework for training Vision-Language Models (VLMs)
to perform Visual Perspective Taking (VPT), a core capability for embodied
cognition essential for Human-Robot Interaction (HRI). As a first step toward
this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse,
that enables supervised learning for spatial reasoning tasks. Each instance
includes an RGB image, a natural language description, and a ground-truth 4X4
transformation matrix representing object pose. We focus on inferring Z-axis
distance as a foundational skill, with future extensions targeting full 6
Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to
support further research. This work serves as a foundational step toward
embodied AI systems capable of spatial understanding in interactive human-robot
scenarios.