OmniJigsaw：通过模态编排重排序增强全模态推理

摘要

为将强化学习后训练范式扩展至全模态模型，以同步增强视频-音频理解与协同推理能力，我们提出OmniJigsaw——一种基于时序重排代理任务的通用自监督框架。该范式以打乱的视听片段时序重建为核心，通过三种策略系统整合视觉与听觉信号以驱动跨模态融合：联合模态整合、样本级模态选择与片段级模态掩码。鉴于代理任务效能与拼图质量密切相关，我们设计了两阶段由粗到精的数据过滤流程，助力OmniJigsaw高效适配海量无标注全模态数据。分析发现联合模态整合中存在“双模态捷径现象”，并证明细粒度片段级模态掩码在超越样本级模态选择的同时能有效缓解该问题。在15个基准测试上的广泛实验表明，该方法在视频、音频及协同推理任务中均取得显著提升，验证了OmniJigsaw作为可扩展自监督全模态学习范式的有效性。

English

To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.

OmniJigsaw：通过模态编排重排序增强全模态推理

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

摘要

Support