ObjectReact:学习面向物体的相对控制以实现视觉导航
ObjectReact: Learning Object-Relative Control for Visual Navigation
September 11, 2025
作者: Sourav Garg, Dustin Craggs, Vineeth Bhat, Lachlan Mares, Stefan Podgorski, Madhava Krishna, Feras Dayoub, Ian Reid
cs.AI
摘要
仅凭单一摄像头和拓扑地图进行视觉导航,近来已成为一种颇具吸引力的替代方案,相较于依赖额外传感器和三维地图的传统方法。这一进展通常通过“图像相对”的方式实现,即从当前观测图像与目标子图像对中估计控制指令。然而,世界在图像层面的表达存在局限,因为图像严格受限于智能体的姿态与具体形态。相比之下,作为地图属性的对象,则提供了一种不受具体形态和轨迹影响的世界表征。本研究中,我们提出了一种学习“对象相对”控制的新范式,展现出多项优势:a) 无需严格模仿过往经验即可探索新路径,b) 控制预测问题可与图像匹配问题解耦,c) 在跨具体形态部署时,面对训练-测试及地图构建-执行场景的差异,能实现高度不变性。我们提出了一种“相对”三维场景图形式的拓扑度量地图表示,用于获取更具信息量的对象级全局路径规划成本。我们训练了一个名为“ObjectReact”的局部控制器,直接基于高级“路径对象成本图”表示进行条件化,从而无需显式RGB输入。我们展示了在传感器高度变化及多项挑战空间理解能力的导航任务中(例如,沿地图轨迹反向导航),学习对象相对控制相较于图像相对控制的优势。此外,我们还证明了仅基于模拟的策略能够良好泛化至现实世界的室内环境。代码及补充材料可通过项目页面访问:https://object-react.github.io/
English
Visual navigation using only a single camera and a topological map has
recently become an appealing alternative to methods that require additional
sensors and 3D maps. This is typically achieved through an "image-relative"
approach to estimating control from a given pair of current observation and
subgoal image. However, image-level representations of the world have
limitations because images are strictly tied to the agent's pose and
embodiment. In contrast, objects, being a property of the map, offer an
embodiment- and trajectory-invariant world representation. In this work, we
present a new paradigm of learning "object-relative" control that exhibits
several desirable characteristics: a) new routes can be traversed without
strictly requiring to imitate prior experience, b) the control prediction
problem can be decoupled from solving the image matching problem, and c) high
invariance can be achieved in cross-embodiment deployment for variations across
both training-testing and mapping-execution settings. We propose a topometric
map representation in the form of a "relative" 3D scene graph, which is used to
obtain more informative object-level global path planning costs. We train a
local controller, dubbed "ObjectReact", conditioned directly on a high-level
"WayObject Costmap" representation that eliminates the need for an explicit RGB
input. We demonstrate the advantages of learning object-relative control over
its image-relative counterpart across sensor height variations and multiple
navigation tasks that challenge the underlying spatial understanding
capability, e.g., navigating a map trajectory in the reverse direction. We
further show that our sim-only policy is able to generalize well to real-world
indoor environments. Code and supplementary material are accessible via project
page: https://object-react.github.io/