ChatPaper.aiChatPaper

重构引导的槽位课程:解决视频对象中心学习中的对象过度碎片化问题

Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning

March 24, 2026
作者: WonJun Moon, Hyun Seok Seong, Jae-Pil Heo
cs.AI

摘要

影片物件中心學習旨在將原始影片分解為少量物件槽位,但現有的槽位注意力模型常存在嚴重過度碎片化問題。這是因為模型被隱性驅使佔用所有槽位以最小化重建目標,導致單一物件被多個冗餘槽位表徵。我們通過重建引導的槽位課程學習(SlotCurri)突破此限制:訓練初期僅使用少量粗粒度槽位,隨後逐步在重建誤差持續偏高區域分配新槽位,由此實現按需擴展表徵容量,從根源避免碎片化。然而在槽位擴展過程中,唯有當粗粒度語義已充分分離時,有意義的子部件才會顯現;但受限於初始槽位預算與均方誤差目標,語義邊界往往保持模糊。為此,我們在均方誤差基礎上引入能保留局部對比度與邊緣信息的結構感知損失,促使每個槽位強化其語義邊界。最後,我們提出循環推理機制,使槽位在幀序列中進行前向與後向滾動,即使在最初幾幀也能產生時間連貫的物件表徵。SlotCurri通過上述三重設計——在重建失敗處動態分配表徵容量,輔以結構線索與循環推理——有效解決物件過度碎片化問題。在YouTube-VIS和MOVi-C數據集上分別實現+6.8和+8.3的FG-ARI顯著提升,驗證了SlotCurri的優越性。代碼已開源於github.com/wjun0830/SlotCurri。
English
Video Object-Centric Learning seeks to decompose raw videos into a small set of object slots, but existing slot-attention models often suffer from severe over-fragmentation. This is because the model is implicitly encouraged to occupy all slots to minimize the reconstruction objective, thereby representing a single object with multiple redundant slots. We tackle this limitation with a reconstruction-guided slot curriculum (SlotCurri). Training starts with only a few coarse slots and progressively allocates new slots where reconstruction error remains high, thus expanding capacity only where it is needed and preventing fragmentation from the outset. Yet, during slot expansion, meaningful sub-parts can emerge only if coarse-level semantics are already well separated; however, with a small initial slot budget and an MSE objective, semantic boundaries remain blurry. Therefore, we augment MSE with a structure-aware loss that preserves local contrast and edge information to encourage each slot to sharpen its semantic boundaries. Lastly, we propose a cyclic inference that rolls slots forward and then backward through the frame sequence, producing temporally consistent object representations even in the earliest frames. All combined, SlotCurri addresses object over-fragmentation by allocating representational capacity where reconstruction fails, further enhanced by structural cues and cyclic inference. Notable FG-ARI gains of +6.8 on YouTube-VIS and +8.3 on MOVi-C validate the effectiveness of SlotCurri. Our code is available at github.com/wjun0830/SlotCurri.
PDF11March 26, 2026