ビデオ物体中心学習のための選択的相乗学習

要旨

典型的なビデオオブジェクト中心学習（VOCL）手法では、再構成駆動型のエンコーダ-デコーダアーキテクチャに依存するスロットベースのフレームワークを採用しており、学習はエンコーダからのアテンションマップとデコーダからのオブジェクトマップという二つの空間マップを介して媒介される。これら二つの異なるマップは異なる特性を示すため、最近の高密度アライメント戦略では、コントラスト学習を介して全ての時空間パッチにわたる一致を強制することで、この不一致を解消しようと試みた。しかし、この無差別なアライメントは、ノイズの多いエンコーダ予測やぼやけたデコーダ境界といった各モジュールの固有の弱点を意図せず伝播させる。さらに、全てのペアにわたる高密度な類似度計算は、時空間パッチの総数に対して二次の計算コストを要し、スケーラビリティを著しく制限する。このような背景から、我々は選択的シナジー学習（Selective Synergistic Learning, SSync）を提案する。SSyncは、網羅的なパッチ間アライメントの代わりに、最も信頼性の高い手がかりのみを選択的に蒸留することでエラー伝播を防ぐ。具体的には、エンコーダは境界の精緻化に、デコーダは内部のノイズ除去に厳密に活用する。これは線形計算量の擬似ラベリングによって実現され、二次的な空間比較の必要性を排除する。また、スロット冗長性のようなアーキテクチャ上のバイアスの強化を防ぐために、時空間活性化の一貫性に基づいて重複スロットを統合する推移的擬似ラベルマージを導入する。広範な実験により、SSyncは分解品質を向上させ、汎用的でプラグアンドプレイなモジュールとして機能し、スロット構成に対して例外的なロバスト性を示すことが実証された。コードはgithub.com/wjun0830/SSyncで公開されている。

English

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.