객체 중심 표현이 로봇 조작에서 정책 일반화를 향상시킨다

초록

시각적 표현은 로봇 조작 정책의 학습 및 일반화 능력에 핵심적인 역할을 합니다. 기존 방법들은 전역적 또는 밀집된 특징에 의존하지만, 이러한 표현은 종종 작업 관련 정보와 무관한 장면 정보를 뒤섞어 분포 변화 하에서의 견고성을 제한합니다. 본 연구에서는 시각적 입력을 완성된 개체 집합으로 분할하여 조작 작업에 더 자연스럽게 부합하는 귀납적 편향을 도입하는 구조화된 대안으로서 객체 중심 표현(OCR)을 탐구합니다. 우리는 단순한 작업부터 복잡한 작업에 이르는 시뮬레이션 및 실제 조작 작업 세트에서 객체 중심, 전역적, 밀집 방법 등 다양한 시각적 인코더를 벤치마킹하고, 조명, 질감 변화 및 방해 요소 존재 등 다양한 시각적 조건 하에서의 일반화 성능을 평가합니다. 연구 결과, OCR 기반 정책은 작업별 사전 학습 없이도 일반화 설정에서 밀집 및 전역적 표현을 능가하는 것으로 나타났습니다. 이러한 통찰은 OCR이 동적이고 실제적인 로봇 환경에서 효과적으로 일반화되는 시각 시스템 설계를 위한 유망한 방향임을 시사합니다.

English

Visual representations are central to the learning and generalization capabilities of robotic manipulation policies. While existing methods rely on global or dense features, such representations often entangle task-relevant and irrelevant scene information, limiting robustness under distribution shifts. In this work, we investigate object-centric representations (OCR) as a structured alternative that segments visual input into a finished set of entities, introducing inductive biases that align more naturally with manipulation tasks. We benchmark a range of visual encoders-object-centric, global and dense methods-across a suite of simulated and real-world manipulation tasks ranging from simple to complex, and evaluate their generalization under diverse visual conditions including changes in lighting, texture, and the presence of distractors. Our findings reveal that OCR-based policies outperform dense and global representations in generalization settings, even without task-specific pretraining. These insights suggest that OCR is a promising direction for designing visual systems that generalize effectively in dynamic, real-world robotic environments.

객체 중심 표현이 로봇 조작에서 정책 일반화를 향상시킨다

Object-Centric Representations Improve Policy Generalization in Robot Manipulation

초록

Support