物件中心表徵提升機器人操作中的策略泛化能力

摘要

視覺表徵對於機器人操作策略的學習與泛化能力至關重要。儘管現有方法依賴於全域或密集特徵，但此類表徵往往將任務相關與無關的場景信息混雜在一起，限制了在分佈變化下的魯棒性。在本研究中，我們探討了以物體為中心的表徵（OCR）作為一種結構化替代方案，它將視覺輸入分割為一組完整的實體，引入了更自然地與操作任務相契合的歸納偏置。我們在一系列從簡單到複雜的模擬及現實世界操作任務中，對比了多種視覺編碼器——包括以物體為中心、全域及密集方法——並評估了它們在不同視覺條件下的泛化能力，這些條件涵蓋了光照、紋理變化以及干擾物的存在。我們的研究結果表明，在泛化場景中，基於OCR的策略即使無需任務特定的預訓練，也能超越密集和全域表徵。這些發現提示，OCR是設計能在動態現實世界機器人環境中有效泛化的視覺系統的一個有前景的方向。

English

Visual representations are central to the learning and generalization capabilities of robotic manipulation policies. While existing methods rely on global or dense features, such representations often entangle task-relevant and irrelevant scene information, limiting robustness under distribution shifts. In this work, we investigate object-centric representations (OCR) as a structured alternative that segments visual input into a finished set of entities, introducing inductive biases that align more naturally with manipulation tasks. We benchmark a range of visual encoders-object-centric, global and dense methods-across a suite of simulated and real-world manipulation tasks ranging from simple to complex, and evaluate their generalization under diverse visual conditions including changes in lighting, texture, and the presence of distractors. Our findings reveal that OCR-based policies outperform dense and global representations in generalization settings, even without task-specific pretraining. These insights suggest that OCR is a promising direction for designing visual systems that generalize effectively in dynamic, real-world robotic environments.

物件中心表徵提升機器人操作中的策略泛化能力

Object-Centric Representations Improve Policy Generalization in Robot Manipulation

摘要

Support