비디오 객체 중심 학습을 위한 선택적 시너지 학습

초록

전형적인 비디오 객체 중심 학습(VOCL) 접근법은 재구성 기반의 인코더-디코더 아키텍처에 의존하는 슬롯 기반 프레임워크를 사용하며, 이때 학습은 두 가지 공간 맵, 즉 인코더의 주의 맵(attention map)과 디코더의 객체 맵(object map)에 의해 매개된다. 이 두 맵은 서로 다른 속성을 나타내므로, 최근의 조밀 정렬(dense alignment) 전략은 대조 학습(contrastive learning)을 통해 모든 시공간 패치 간의 일치를 강제함으로써 이러한 차이를 조정하려고 시도했다. 그러나 이러한 무분별한 정렬은 잡음이 많은 인코더 예측과 흐릿한 디코더 경계와 같은 각 모듈의 고유한 약점을 의도치 않게 전파한다. 게다가 모든 쌍 간의 조밀 유사도를 계산하는 것은 총 시공간 패치 수에 대해 이차(quadratic)의 계산 비용을 초래하여 확장성을 심각하게 제한한다. 이에 착안하여 우리는 선택적 시너지 학습(Selective Synergistic Learning, SSync)을 제안한다. SSync는 모든 패치 간의 정렬을 수행하는 대신, 가장 신뢰할 수 있는 단서만을 선택적으로 증류하여 오류 전파를 방지한다. 즉, 인코더는 경계 정제에, 디코더는 내부 잡음 제거에 엄격히 활용한다. 이는 선형 복잡도를 갖는 의사 레이블링(pseudo-labeling)을 통해 실현되며, 이차적 공간 비교의 필요성을 제거한다. 또한, 슬롯 중복과 같은 구조적 편향의 강화를 방지하기 위해, 시공간 활성화 일관성에 기반하여 중첩되는 슬롯을 통합하는 전이적 의사 레이블 병합(transitive pseudo-label merging)을 도입한다. 광범위한 실험을 통해 SSync가 분할 품질을 향상시키고 플러그 앤 플레이(plug-and-play) 방식의 다용도 모듈로 기능할 뿐만 아니라 슬롯 구성에 대해 탁월한 견고성을 보임을 입증한다. 코드는 github.com/wjun0830/SSync에서 확인할 수 있다.

English

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.