MergeMix: 視覚的・マルチモーダル理解のための統合的データ拡張パラダイム

要旨

マルチモーダル大規模言語モデル（MLLM）における視覚言語アライメントは、通常、教師ありファインチューニング（SFT）または強化学習（RL）に依存している。SFTは安定性と効率性に優れるが、大規模な人手によるアノテーションを必要とし、微妙な選好を捉えることができない。一方、RLは報酬信号を学習に導入するが、計算コストと不安定性に悩まされる。これらの制限は、拡張性、頑健性、アライメント品質の間のトレードオフを浮き彫りにしている。この問題に対処するため、我々はSFTとRLを橋渡しする訓練時データ拡張パラダイムであるMergeMixを提案する。MergeMixはまず、より多くのクラスタ表現と空間的コンテキストを備えたトークンマージによる注意機構を考慮した画像混合を適用し、次に、混合画像と元画像で選好ペアを構築し、SimPO損失による最適化を行う選好駆動型訓練パラダイムをMLLM向けに提示する。Mixup拡張として、MergeMixは注意の一貫性と効率性を向上させ、分類タスクにおいて他のヒューリスティックベースの手法を凌駕する。大規模な実験により、MergeMixが効率を改善しつつ競争力のある精度を達成し、分類およびMLLMにおける選好アライメントのための拡張性のあるアプローチを提供することが実証された。

English

Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.

MergeMix: 視覚的・マルチモーダル理解のための統合的データ拡張パラダイム

MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding

要旨

Support