達爾文家族:基於MRI信任加權的演化合併實現語言模型推理的免訓練擴展
Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning
May 14, 2026
作者: Taebong Kim, Youngsik Hong, Minsik Kim, Sunyoung Choi, Jaewon Jang, Junghoon Shin, Minseo Kim
cs.AI
摘要
我們提出Darwin Family,這是一個透過無梯度權重空間重組、無需訓練即可演化合併大型語言模型的框架。我們探討是否能在不進行額外訓練的情況下,透過重新組織現有檢查點中已編碼的潛在能力,來提升前沿級推理表現。Darwin引入三個關鍵概念:(i) 一個14維的自適應合併基因組,能實現細粒度的組件與區塊級重組;(ii) MRI-Trust Fusion,透過可學習的信任參數,自適應地平衡診斷性層重要性訊號與演化搜索;以及 (iii) 架構映射器 (Architecture Mapper),能實現異質模型家族之間的跨架構培育。實驗上,旗艦模型Darwin-27B-Opus在GPQA Diamond上達到86.9%的準確率,在1,252個受評模型中排名第6,且在不使用任何梯度訓練的情況下,表現超越其經過完整訓練的基礎模型。在4B到35B參數的規模範圍內,Darwin模型持續優於其父代,支援遞迴多世代演化,並能實現結合Transformer與Mamba元件的免訓練演化合併。整體而言,Darwin Family證明,對於以推理為中心的語言模型,診斷引導的演化合併是一種可實作且可重現的替代方案,可取代成本高昂的後訓練流程。
English
We present Darwin Family, a framework for training-free evolutionary merging of large language models via gradient-free weight-space recombination. We ask whether frontier-level reasoning performance can be improved without additional training, by reorganizing latent capabilities already encoded in existing checkpoints. Darwin introduces three key ideas: (i) a 14-dimensional adaptive merge genome enabling fine-grained component- and block-level recombination; (ii) MRI-Trust Fusion, which adaptively balances diagnostic layer-importance signals with evolutionary search through a learnable trust parameter; and (iii) an Architecture Mapper that enables cross-architecture breeding between heterogeneous model families. Empirically, the flagship Darwin-27B-Opus achieves 86.9% on GPQA Diamond, ranking #6 among 1,252 evaluated models, and outperforming its fully trained foundation model without any gradient-based training. Across scales from 4B to 35B parameters, Darwin models consistently improve over their parents, support recursive multi-generation evolution, and enable a training-free evolutionary merge that combines Transformer- and Mamba-based components. Together, the Darwin Family demonstrates that diagnostic-guided evolutionary merging is a practical and reproducible alternative to costly post-training pipelines for reasoning-centric language models.