객체 중심 학습은 이제 끝난 것인가?

초록

객체 중심 학습(Object-centric learning, OCL)은 장면 내 다른 객체나 배경 단서와 분리된 객체만을 인코딩하는 표현을 학습하는 것을 목표로 합니다. 이 접근법은 분포 외 일반화(out-of-distribution, OOD), 샘플 효율적 구성, 구조화된 환경 모델링 등 다양한 목적을 뒷받침합니다. 대부분의 연구는 표현 공간에서 객체를 개별 슬롯으로 분리하는 비지도 메커니즘 개발에 초점을 맞추어 왔으며, 이는 비지도 객체 탐지를 통해 평가됩니다. 그러나 최근의 샘플 효율적 분할 모델을 통해 픽셀 공간에서 객체를 분리하고 독립적으로 인코딩할 수 있게 되었습니다. 이는 OOD 객체 탐지 벤치마크에서 뛰어난 제로샷 성능을 달성하며, 기반 모델(foundation models)로 확장 가능하고, 변동 가능한 슬롯 수를 즉시 처리할 수 있습니다. 따라서 OCL 방법의 목표인 객체 중심 표현을 얻는 것은 크게 달성되었습니다. 이러한 진전에도 불구하고, 여전히 중요한 질문이 남아 있습니다: 장면 내 객체를 분리하는 능력이 OOD 일반화와 같은 더 넓은 OCL 목표에 어떻게 기여하는가? 우리는 OCL의 관점에서 잘못된 배경 단서로 인한 OOD 일반화 문제를 조사함으로써 이를 해결합니다. 우리는 Object-Centric Classification with Applied Masks (OCCAM)라는 새로운, 학습이 필요 없는 프로브를 제안하며, 개별 객체의 분할 기반 인코딩이 슬롯 기반 OCL 방법을 크게 능가함을 보여줍니다. 그러나 실제 응용에서의 과제는 여전히 남아 있습니다. 우리는 OCL 커뮤니티가 확장 가능한 객체 중심 표현을 사용할 수 있도록 도구 상자를 제공하며, 인간 인지에서의 객체 인식 이해와 같은 실용적 응용 및 근본적 질문에 초점을 맞춥니다. 우리의 코드는 https://github.com/AlexanderRubinstein/OCCAM에서 확인할 수 있습니다.

English

Object-centric learning (OCL) seeks to learn representations that only encode an object, isolated from other objects or background cues in a scene. This approach underpins various aims, including out-of-distribution (OOD) generalization, sample-efficient composition, and modeling of structured environments. Most research has focused on developing unsupervised mechanisms that separate objects into discrete slots in the representation space, evaluated using unsupervised object discovery. However, with recent sample-efficient segmentation models, we can separate objects in the pixel space and encode them independently. This achieves remarkable zero-shot performance on OOD object discovery benchmarks, is scalable to foundation models, and can handle a variable number of slots out-of-the-box. Hence, the goal of OCL methods to obtain object-centric representations has been largely achieved. Despite this progress, a key question remains: How does the ability to separate objects within a scene contribute to broader OCL objectives, such as OOD generalization? We address this by investigating the OOD generalization challenge caused by spurious background cues through the lens of OCL. We propose a novel, training-free probe called Object-Centric Classification with Applied Masks (OCCAM), demonstrating that segmentation-based encoding of individual objects significantly outperforms slot-based OCL methods. However, challenges in real-world applications remain. We provide the toolbox for the OCL community to use scalable object-centric representations, and focus on practical applications and fundamental questions, such as understanding object perception in human cognition. Our code is available https://github.com/AlexanderRubinstein/OCCAM{here}.

객체 중심 학습은 이제 끝난 것인가?

Are We Done with Object-Centric Learning?

초록

Support