**Souper-Model:简单算术如何解锁大型语言模型的顶尖性能**
Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance
November 17, 2025
作者: Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach
cs.AI
摘要
大型语言模型(LLMs)已在多个领域展现出卓越能力,但其训练过程仍需要消耗大量资源和时间,不仅依赖大规模算力支撑,还需精细协调训练流程。模型融合(model souping)——即对同架构多个模型的权重进行平均化处理——已成为一种颇具前景的训练前/后优化技术,可在避免昂贵重复训练的前提下提升模型性能。本文提出类别专家融合法(SoCE),该方法通过基准测试组合识别最优模型候选集,并采用非均匀加权平均来最大化性能,为模型融合提供了系统化实现路径。与先前采用均匀加权的方法不同,我们的方法基于一个重要发现:不同基准测试类别在模型性能表现上往往具有低相关性。SoCE通过识别弱相关类别簇中的“专家”模型,采用优化后的非均匀加权策略进行融合。实验表明,该方法在多语言能力、工具调用、数学推理等多个领域均能提升模型性能与鲁棒性,并在伯克利函数调用排行榜上取得了最先进的结果。
English
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.