ObjectReact:學習物件相對控制以實現視覺導航
ObjectReact: Learning Object-Relative Control for Visual Navigation
September 11, 2025
作者: Sourav Garg, Dustin Craggs, Vineeth Bhat, Lachlan Mares, Stefan Podgorski, Madhava Krishna, Feras Dayoub, Ian Reid
cs.AI
摘要
僅使用單一攝像頭和拓撲地圖進行視覺導航,最近已成為需要額外傳感器和三維地圖方法的一種吸引人的替代方案。這通常通過一種“圖像相對”的方法來實現,即從給定的當前觀測圖像和子目標圖像對中估計控制。然而,世界在圖像層面的表示存在侷限性,因為圖像嚴格依賴於智能體的姿態和具體形態。相比之下,作為地圖屬性的物體,提供了一種與具體形態和軌跡無關的世界表示。在本研究中,我們提出了一種學習“物體相對”控制的新範式,該範式展現了幾個理想特性:a) 無需嚴格模仿先前經驗即可探索新路線,b) 控制預測問題可以與解決圖像匹配問題分離,c) 在跨具體形態部署中,對於訓練-測試和地圖-執行設置的變化,可以實現高度不變性。我們提出了一種以“相對”三維場景圖形式呈現的拓撲度量地圖表示,用於獲取更具信息量的物體層面全局路徑規劃成本。我們訓練了一個名為“ObjectReact”的局部控制器,直接基於高層次的“WayObject Costmap”表示進行條件化,從而消除了對顯式RGB輸入的需求。我們展示了在傳感器高度變化和多種挑戰底層空間理解能力的導航任務中(例如,反向導航地圖軌跡),學習物體相對控制相較於圖像相對控制的優勢。我們進一步表明,僅在模擬中訓練的策略能夠很好地泛化到現實世界的室內環境中。代碼和補充材料可通過項目頁面訪問:https://object-react.github.io/
English
Visual navigation using only a single camera and a topological map has
recently become an appealing alternative to methods that require additional
sensors and 3D maps. This is typically achieved through an "image-relative"
approach to estimating control from a given pair of current observation and
subgoal image. However, image-level representations of the world have
limitations because images are strictly tied to the agent's pose and
embodiment. In contrast, objects, being a property of the map, offer an
embodiment- and trajectory-invariant world representation. In this work, we
present a new paradigm of learning "object-relative" control that exhibits
several desirable characteristics: a) new routes can be traversed without
strictly requiring to imitate prior experience, b) the control prediction
problem can be decoupled from solving the image matching problem, and c) high
invariance can be achieved in cross-embodiment deployment for variations across
both training-testing and mapping-execution settings. We propose a topometric
map representation in the form of a "relative" 3D scene graph, which is used to
obtain more informative object-level global path planning costs. We train a
local controller, dubbed "ObjectReact", conditioned directly on a high-level
"WayObject Costmap" representation that eliminates the need for an explicit RGB
input. We demonstrate the advantages of learning object-relative control over
its image-relative counterpart across sensor height variations and multiple
navigation tasks that challenge the underlying spatial understanding
capability, e.g., navigating a map trajectory in the reverse direction. We
further show that our sim-only policy is able to generalize well to real-world
indoor environments. Code and supplementary material are accessible via project
page: https://object-react.github.io/