재구성 기반 슬롯 커리큘럼: 비디오 객체 중심 학습에서의 객체 과도 분할 문제 해결

초록

비디오 객체 중심 학습은 원본 비디오를 소수의 객체 슬롯으로 분해하려는 방법이지만, 기존 슬롯 어텐션 모델은 심각한 과도한 분할 문제를 겪는 경우가 많습니다. 이는 재구성 목적 함수를 최소화하기 위해 모델이 모든 슬롯을 사용하도록 암묵적으로 장려되며, 결과적으로 단일 객체가 여러 중복 슬롯으로 표현되기 때문입니다. 우리는 이러한 한계를 재구성 기반 슬롯 커리큘럼(SlotCurri)으로 해결합니다. 학습은 소수의 coarse 슬롯으로 시작하여 재구성 오류가 높은 영역에 점진적으로 새로운 슬롯을 할당함으로써 필요한 곳에만 용량을 확장하고 초기부터 분할을 방지합니다. 그러나 슬롯 확장 과정에서 의미 있는 하위 부분이 나타나려면 coarse 수준의 의미 체계가 이미 잘 분리되어 있어야 하는데, 적은 초기 슬롯 예산과 MSE 목적 함수를 사용할 경우 의미론적 경계가 흐릿하게 유지됩니다. 따라서 우리는 MSE에 지역 대비와 에지 정보를 보존하는 구조 인식 손실을 추가하여 각 슬롯이 의미론적 경계를 선명하게 구축하도록 유도합니다. 마지막으로, 슬롯을 프레임 시퀀스 따라 앞뒤로 롤링하는 순환 추론을 제안하여 초기 프레임에서도 시간적으로 일관된 객체 표현을 생성합니다. 이러한 방법들을 종합한 SlotCurri는 재구성이 실패한 영역에 표현 용량을 할당하여 객체의 과도한 분할 문제를 해결하며, 구조적 단서와 순환 추론을 통해 그 효과를 더욱 향상시킵니다. YouTube-VIS에서 +6.8, MOVi-C에서 +8.3의 주목할 만한 FG-ARI 성능 향상은 SlotCurri의 효과성을 입증합니다. 우리의 코드는 github.com/wjun0830/SlotCurri에서 확인할 수 있습니다.

English

Video Object-Centric Learning seeks to decompose raw videos into a small set of object slots, but existing slot-attention models often suffer from severe over-fragmentation. This is because the model is implicitly encouraged to occupy all slots to minimize the reconstruction objective, thereby representing a single object with multiple redundant slots. We tackle this limitation with a reconstruction-guided slot curriculum (SlotCurri). Training starts with only a few coarse slots and progressively allocates new slots where reconstruction error remains high, thus expanding capacity only where it is needed and preventing fragmentation from the outset. Yet, during slot expansion, meaningful sub-parts can emerge only if coarse-level semantics are already well separated; however, with a small initial slot budget and an MSE objective, semantic boundaries remain blurry. Therefore, we augment MSE with a structure-aware loss that preserves local contrast and edge information to encourage each slot to sharpen its semantic boundaries. Lastly, we propose a cyclic inference that rolls slots forward and then backward through the frame sequence, producing temporally consistent object representations even in the earliest frames. All combined, SlotCurri addresses object over-fragmentation by allocating representational capacity where reconstruction fails, further enhanced by structural cues and cyclic inference. Notable FG-ARI gains of +6.8 on YouTube-VIS and +8.3 on MOVi-C validate the effectiveness of SlotCurri. Our code is available at github.com/wjun0830/SlotCurri.

재구성 기반 슬롯 커리큘럼: 비디오 객체 중심 학습에서의 객체 과도 분할 문제 해결

Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning

초록

Support