Selectief Synergisch Leren voor Objectgericht Leren in Video

Samenvatting

Typische benaderingen voor video-objectgecentreerd leren (VOCL) maken gebruik van op slots gebaseerde raamwerken die steunen op reconstructiegedreven encoder-decoder architecturen, waarbij het leren wordt gemedieerd door twee ruimtelijke kaarten: aandachtskaarten van de encoder en objectkaarten van de decoder. Aangezien deze twee verschillende kaarten verschillende eigenschappen vertonen, probeerde een recente dichte aligneringsstrategie deze discrepantie te verzoenen door overeenstemming af te dwingen over alle ruimtelijk-temporele patches via contrastief leren. Deze niet-selectieve alignering verspreidt echter onbedoeld de inherente zwakheden van elke module, zoals ruisende encoder-voorspellingen en vervaagde decoder-grenzen. Bovendien brengt het berekenen van dichte overeenkomsten over alle paren een computationele kosten met zich mee die kwadratisch is in het totale aantal ruimtelijk-temporele patches, wat de schaalbaarheid ernstig beperkt. Gemotiveerd hierdoor stellen we Selectief Synergetisch Leren (SSync) voor. In plaats van uitputtende patch-tot-patch-alignering voorkomt SSync foutpropagatie door alleen de meest betrouwbare signalen selectief te distilleren: de encoder strikt gebruiken voor grensverfijning en de decoder voor interne ruisonderdrukking. Dit wordt gerealiseerd via een pseudo-labeling met lineaire complexiteit, waardoor de noodzaak voor kwadratische ruimtelijke vergelijkingen wordt geëlimineerd. Ook introduceren we, om de versterking van architecturale vooroordelen zoals slot-redundantie te voorkomen, een transitieve pseudo-label-samenvoeging die overlappende slots consolideert op basis van ruimtelijk-temporele activeringsconsistentie. Uitgebreide studies tonen aan dat SSync de decompositiekwaliteit verbetert en dient als een veelzijdige, plug-and-play-module, terwijl het ook uitzonderlijke robuustheid vertoont ten opzichte van slot-configuraties. Code is beschikbaar op github.com/wjun0830/SSync.

English

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.