融合混合：面向视觉与多模态理解的统一增强范式

摘要

多模态大语言模型（MLLMs）中的视觉-语言对齐通常依赖于监督微调（SFT）或强化学习（RL）。SFT方法稳定高效但需大规模人工标注且难以捕捉细微偏好，而RL虽引入奖励信号进行训练，却存在计算开销大与稳定性不足的问题。这些局限凸显了可扩展性、鲁棒性以及对齐质量之间的权衡。为此，我们提出MergeMix——一种连接SFT与RL的训练时增强范式。该方法首先通过具有更强聚类表征与空间上下文感知的令牌融合实现注意力感知的图像混合，随后构建混合图像与原始图像的偏好对，并采用SimPO损失函数优化，形成偏好驱动的MLLMs训练范式。作为混合增强技术，MergeMix通过提升注意力一致性与训练效率，在分类任务中超越了其他基于启发式的方法。大量实验表明，MergeMix在保持竞争力的分类准确率同时显著提升效率，为分类任务与MLLMs的偏好对齐提供了可扩展的解决方案。

English

Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.

融合混合：面向视觉与多模态理解的统一增强范式

MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding

摘要

Support