分散式指令微调：冲突感知拆分与权重合并

摘要

指令微调使大型语言模型（包括多模态模型）能够适配多样化的用户意图，但扩展到异构混合数据集时，梯度干扰和带宽密集型的同步机制成为主要瓶颈。我们探讨能否通过独立训练部分数据混合集并在参数空间中一次性合并，从而联合解决这两个瓶颈。在共享平坦盆地内建立局部二次型理论，得到三个结果：权重合并产生曲率加权的方差缩减；PCA对齐的冲突分割能沿着高曲率方向最大化这一增益；合并还能充当谱滤波并隐含范数正则化。这些结果直接催生了MERIT——一种去中心化、可合并的指令微调流水线：估算数据集级别的梯度冲突，沿主PCA冲突轴划分数据混合集，各分区独立微调（无需分区间通信），最后通过令牌加权平均一次合并。在包含136个Vision-FLAN任务的Qwen2.5-VL-3B模型上，MERIT将8个基准测试的平均分从54.3（联合训练）提升至57.0。相同方案可扩展到70亿参数模型、160万样本、176个来源的混合数据集，性能达到或超越集中式联合训练且额外开销极小，并可直接迁移至纯文本FLAN数据集。代码开源：https://github.com/naver-ai/merit。

English

Instruction tuning aligns large language models, including multimodal ones, with diverse user intents, but scaling to heterogeneous mixtures is hindered by gradient interference and bandwidth-heavy synchronization. We ask whether these two bottlenecks can be addressed jointly by training parts of the mixture independently and reconciling them once in parameter space. We develop a local quadratic theory inside a shared flat basin that yields three results: weight merging produces a curvature-weighted variance reduction; PCA-aligned conflict splitting maximizes this gain along high-curvature directions; and merging additionally acts as spectral filtering with implicit norm regularization. These results directly motivate MERIT, a decentralized merge-ready instruction-tuning pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top PCA conflict axes, fine-tunes each partition independently with no inter-partition communication, and merges once via token-weighted averaging. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves the 8-benchmark average from 54.3 (joint training) to 57.0. The same recipe scales to a 7B model on a 1.6M-example, 176-source mixture -- matching or exceeding centralized joint training with minimal cost overhead -- and transfers to text-only FLAN. Our code is available at https://github.com/naver-ai/merit.