MergeMix:视觉与多模态理解的统一增强范式
MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
October 27, 2025
作者: Xin Jin, Siyuan Li, Siyong Jian, Kai Yu, Huan Wang
cs.AI
摘要
多模态大语言模型(MLLMs)中的视觉-语言对齐通常依赖于监督微调(SFT)或强化学习(RL)。SFT方法稳定高效但需大规模人工标注且难以捕捉细微偏好,而RL虽能引入奖励信号进行训练,却存在计算开销大与稳定性不足的问题。这些局限性凸显了可扩展性、鲁棒性以及对齐质量之间的权衡。为此,我们提出MergeMix——一种连接SFT与RL的训练时增强范式。该方法首先通过具有更丰富聚类表征与空间上下文的令牌融合实现注意力感知的图像混合,随后构建混合图像与原始图像的偏好对,并采用SimPO损失函数进行优化,形成偏好驱动的MLLMs训练范式。作为混合增强技术,MergeMix通过提升注意力一致性与训练效率,在分类任务中超越了其他基于启发式的方法。大量实验表明,MergeMix以更优的效率实现了具有竞争力的准确率,为分类任务和MLLMs的偏好对齐提供了可扩展的解决方案。
English
Vision-language alignment in multi-modal large language models (MLLMs)
typically relies on supervised fine-tuning (SFT) or reinforcement learning
(RL). SFT is stable and efficient but requires large-scale human annotations
and cannot capture subtle preferences, while RL brings in a reward signal for
training, but suffers from overhead and instability. These limitations
highlight a trade-off between scalability, robustness, and alignment quality.
To address this, we propose MergeMix, a training-time augmentation paradigm
that bridges SFT and RL. It first applies an attention-aware image mixing via
token merge with more cluster representation and spatial context, and then
presents a preference-driven training paradigm for MLLMs by building preference
pairs with mixed images and raw images, and optimizing via SimPO loss. As a
mixup augmentation, MergeMix enhances attention consistency and efficiency,
surpassing other heuristic-based methods in classification. Extensive
experiments demonstrate that MergeMix achieves competitive accuracy with
improved efficiency, providing a scalable approach to preference alignment in
classification and MLLMs.