面向视频对象中心学习的选择性协同学习
Selective Synergistic Learning for Video Object-Centric Learning
June 14, 2026
作者: WonJun Moon, Jae-Pil Heo
cs.AI
摘要
典型的视频以对象为中心的学习方法采用基于槽位的框架,依赖重建驱动的编码器-解码器架构,其中学习过程由两类空间图中介:编码器的注意力图和解码器的对象图。由于这两类不同的图具有不同特性,近期一种密集对齐策略试图通过对比学习强制所有时空补丁之间的一致性来调和这一差异。然而,这种无差别对齐会无意中传播每个模块的固有缺陷,例如编码器的噪声预测和解码器的模糊边界。此外,计算所有补丁对之间的密集相似性会导致时空补丁总数二次方的计算成本,严重限制可扩展性。受此启发,我们提出选择性协同学习(SSync)。与穷举式补丁到补丁对齐不同,SSync通过选择性蒸馏仅最可靠的线索来防止误差传播:严格利用编码器进行边界细化,而利用解码器进行内部去噪。这是通过线性复杂度的伪标签实现,消除了二次空间比较的需求。同时,为防止强化架构偏差(如槽位冗余),我们引入传递性伪标签合并机制,基于时空激活一致性整合重叠槽位。大量研究表明,SSync能提升分解质量,并作为通用即插即用模块,同时对槽位配置表现出卓越鲁棒性。代码见 github.com/wjun0830/SSync。
English
Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.