重构引导的槽位课程:解决视频物体中心学习中的对象过度碎片化问题
Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning
March 24, 2026
作者: WonJun Moon, Hyun Seok Seong, Jae-Pil Heo
cs.AI
摘要
视频物体中心学习旨在将原始视频分解为少量物体槽位,但现有槽注意力模型常存在严重过度碎片化问题。这是因为模型被隐式鼓励占用所有槽位以最小化重建目标,导致单个物体被多个冗余槽位表征。我们通过重建引导的槽位课程学习(SlotCurri)突破这一局限:训练初期仅使用少量粗粒度槽位,随后在重建误差较高处逐步分配新槽位,从而仅在需要时扩展表征能力,从源头防止碎片化。然而在槽位扩展过程中,唯有当粗粒度语义已充分分离时才有意义子部件才会显现;但受限于初始槽位预算和均方误差目标,语义边界往往模糊不清。为此,我们在均方误差基础上引入结构感知损失,通过保持局部对比度和边缘信息来强化各槽位的语义边界。最后,我们提出循环推理机制,使槽位在帧序列中先向前后向后滚动,即使在初始帧也能产生时序一致的物体表征。SlotCurri通过三重创新——在重建失败处分配表征能力、增强结构线索、引入循环推理——有效解决物体过度碎片化问题。在YouTube-VIS和MOVi-C数据集上分别实现+6.8和+8.3的显著FG-ARI提升,验证了该方法的有效性。代码已开源于github.com/wjun0830/SlotCurri。
English
Video Object-Centric Learning seeks to decompose raw videos into a small set of object slots, but existing slot-attention models often suffer from severe over-fragmentation. This is because the model is implicitly encouraged to occupy all slots to minimize the reconstruction objective, thereby representing a single object with multiple redundant slots. We tackle this limitation with a reconstruction-guided slot curriculum (SlotCurri). Training starts with only a few coarse slots and progressively allocates new slots where reconstruction error remains high, thus expanding capacity only where it is needed and preventing fragmentation from the outset. Yet, during slot expansion, meaningful sub-parts can emerge only if coarse-level semantics are already well separated; however, with a small initial slot budget and an MSE objective, semantic boundaries remain blurry. Therefore, we augment MSE with a structure-aware loss that preserves local contrast and edge information to encourage each slot to sharpen its semantic boundaries. Lastly, we propose a cyclic inference that rolls slots forward and then backward through the frame sequence, producing temporally consistent object representations even in the earliest frames. All combined, SlotCurri addresses object over-fragmentation by allocating representational capacity where reconstruction fails, further enhanced by structural cues and cyclic inference. Notable FG-ARI gains of +6.8 on YouTube-VIS and +8.3 on MOVi-C validate the effectiveness of SlotCurri. Our code is available at github.com/wjun0830/SlotCurri.