神经丛林：预训练权重周围密布多样化任务专家

摘要

预训练产生的学习参数向量通常被视为后续迭代适应的起点。在本研究中，我们提出将预训练结果视作参数向量上的概率分布，其支撑集已包含任务特定的专家解。我们证明在小型模型中，此类专家解仅占据该分布体积的极小部分，需依赖梯度下降等结构化优化方法才能发现。相反，在经过充分预训练的大型模型中，任务专家的密度显著增加，使得多样化、能提升任务性能的专家解密集分布于预训练权重邻域内。基于此视角，我们探索了一种完全并行的后训练方法：随机采样N个参数扰动，选取最优的K个扰动，通过多数投票集成预测。尽管方法简单，该策略在当代大规模模型中与PPO、GRPO和ES等标准后训练方法相比仍具竞争力。

English

Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples N parameter perturbations at random, selects the top K, and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post-training methods such as PPO, GRPO, and ES for contemporary large-scale models.