OmniJigsaw：モダリティを編成した再配置によるオムニモーダル推論の強化

要旨

強化学習に基づく事後学習パラダイムをオムニモーダルモデルに拡張し、映像・音声理解と協調的推論能力の同時強化を図るため、本論文では時間的再順序付け代理タスクに基づく汎用自己教師ありフレームワークOmniJigsawを提案する。本パラダイムは、シャッフルされた音響視覚クリップの時系列再構築を中核とし、視覚信号と聴覚信号を戦略的に統合することで、以下の3つの異なる戦略を通じたクロスモーダル統合を促進する：(1) 結合モダリティ統合、(2) サンプルレベルモダリティ選択、(3) クリップレベルモダリティマスキング。さらに、代理タスクの有効性がパズルの質に根本的に依存することを踏まえ、粗い選別から精密な選別へと段階を追う2段階データフィルタリングパイプラインを設計し、大規模な未注釈オムニモーダルデータへのOmniJigsawの効率的適応を可能にした。分析により、結合モダリティ統合において「二モーダル短絡現象」が生じることを明らかにし、細粒度なクリップレベルモダリティマスキングがこの問題を緩和するとともに、サンプルレベルモダリティ選択を上回る性能を示すことを実証した。15のベンチマークによる広範な評価では、映像・音声理解および協調的推論タスクにおいて大幅な性能向上を確認し、OmniJigsawが自己教師ありオムニモーダル学習のスケーラブルなパラダイムとして有効であることを検証した。

English

To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.

OmniJigsaw：モダリティを編成した再配置によるオムニモーダル推論の強化

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

要旨

Support