Darwinファミリー: MRI信頼重み付け進化的マージによる言語モデル推論の訓練不要スケーリング

要旨

本稿では、Darwin Familyを提案する。これは、大規模言語モデルを訓練なしで進化的に統合するための枠組みであり、勾配を用いない重み空間上の再結合により実現される。追加の訓練を行うことなく、既存のチェックポイントに符号化された潜在能力を再編成することで、フロンティアレベルの推論性能が向上するかどうかを問う。Darwinは以下の3つの主要なアイデアを導入する。(i) 14次元の適応的マージゲノムにより、コンポーネントおよびブロックレベルでの細粒度の再結合を可能にする。(ii) MRI-Trust Fusionは、学習可能な信頼パラメータを通じて、診断的な層重要度信号と進化的探索を適応的にバランスする。(iii) Architecture Mapperにより、異種のモデルファミリー間でのクロスアーキテクチャ育種を可能にする。実験的に、代表モデルであるDarwin-27B-OpusはGPQA Diamondで86.9%を達成し、評価された1,252モデル中第6位となり、勾配ベースの訓練を一切行わずに完全訓練済みの基盤モデルを上回る。4Bから35Bパラメータの規模にわたり、Darwinモデルは一貫して親モデルよりも改善され、再帰的な多世代進化をサポートし、TransformerベースとMambaベースのコンポーネントを組み合わせた訓練不要の進化的マージを実現する。Darwin Familyは全体として、診断誘導型の進化的マージが、推論中心の言語モデルにおける高コストな後処理パイプラインに代わる実用的かつ再現可能な選択肢であることを示している。

English

We present Darwin Family, a framework for training-free evolutionary merging of large language models via gradient-free weight-space recombination. We ask whether frontier-level reasoning performance can be improved without additional training, by reorganizing latent capabilities already encoded in existing checkpoints. Darwin introduces three key ideas: (i) a 14-dimensional adaptive merge genome enabling fine-grained component- and block-level recombination; (ii) MRI-Trust Fusion, which adaptively balances diagnostic layer-importance signals with evolutionary search through a learnable trust parameter; and (iii) an Architecture Mapper that enables cross-architecture breeding between heterogeneous model families. Empirically, the flagship Darwin-27B-Opus achieves 86.9% on GPQA Diamond, ranking #6 among 1,252 evaluated models, and outperforming its fully trained foundation model without any gradient-based training. Across scales from 4B to 35B parameters, Darwin models consistently improve over their parents, support recursive multi-generation evolution, and enable a training-free evolutionary merge that combines Transformer- and Mamba-based components. Together, the Darwin Family demonstrates that diagnostic-guided evolutionary merging is a practical and reproducible alternative to costly post-training pipelines for reasoning-centric language models.