분산형 명령어 튜닝: 충돌 인식 분할 및 가중치 병합

초록

지시 튜닝은 멀티모달 모델을 포함한 대규모 언어 모델을 다양한 사용자 의도에 맞게 정렬하지만, 이질적 혼합 데이터로의 확장은 그래디언트 간섭과 대역폭이 많이 소모되는 동기화로 인해 제약을 받는다. 본 연구에서는 혼합 데이터의 일부를 독립적으로 학습하고 파라미터 공간에서 한 번만 조정함으로써 이 두 병목 현상을 동시에 해결할 수 있는지 묻는다. 공유된 평평한 분지 내에서 국소 이차 이론을 전개하여 세 가지 결과를 도출한다: 가중치 병합은 곡률 가중 분산 감소를 생성하고, PCA 정렬 충돌 분할은 곡률이 큰 방향을 따라 이 이득을 최대화하며, 병합은 추가로 암묵적 노름 정규화를 수반하는 스펙트럼 필터링으로 작용한다. 이러한 결과는 MERIT(분산형 병합 준비 지시 튜닝 파이프라인)을 직접적으로 동기 부여하며, 이 파이프라인은 데이터셋 수준의 그래디언트 충돌을 추정하고, 상위 PCA 충돌 축을 따라 혼합 데이터를 분할하며, 각 파티션을 파티션 간 통신 없이 독립적으로 미세 조정한 후 토큰 가중 평균을 통해 한 번 병합한다. 136개의 Vision-FLAN 작업을 사용한 Qwen2.5-VL-3B 실험에서 MERIT은 8개 벤치마크 평균을 54.3(공동 학습)에서 57.0으로 향상시켰다. 동일한 방법은 160만 개 예제와 176개 출처로 구성된 혼합 데이터에 7B 모델로 확장되어, 최소한의 비용 오버헤드로 중앙 집중식 공동 학습과 일치하거나 능가하며, 텍스트 전용 FLAN으로도 전이된다. 코드는 https://github.com/naver-ai/merit에서 확인할 수 있다.

English

Instruction tuning aligns large language models, including multimodal ones, with diverse user intents, but scaling to heterogeneous mixtures is hindered by gradient interference and bandwidth-heavy synchronization. We ask whether these two bottlenecks can be addressed jointly by training parts of the mixture independently and reconciling them once in parameter space. We develop a local quadratic theory inside a shared flat basin that yields three results: weight merging produces a curvature-weighted variance reduction; PCA-aligned conflict splitting maximizes this gain along high-curvature directions; and merging additionally acts as spectral filtering with implicit norm regularization. These results directly motivate MERIT, a decentralized merge-ready instruction-tuning pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top PCA conflict axes, fine-tunes each partition independently with no inter-partition communication, and merges once via token-weighted averaging. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves the 8-benchmark average from 54.3 (joint training) to 57.0. The same recipe scales to a 7B model on a 1.6M-example, 176-source mixture -- matching or exceeding centralized joint training with minimal cost overhead -- and transfers to text-only FLAN. Our code is available at https://github.com/naver-ai/merit.