生长式Transformer:在冻结基座上的模块化组合与逐层扩展
Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate
July 8, 2025
作者: A. Bochkov
cs.AI
摘要
当前扩展大型语言模型(LLMs)的主流范式依赖于整体、端到端的训练,这一过程资源消耗巨大且缺乏灵活性。本文探讨了一种基于非可训练、确定性输入嵌入的替代性、建设性模型开发方法。在先前的研究[1]中,我们证实了利用源自Unicode字符视觉结构的冻结嵌入,Transformer中能够涌现出高层次的语义推理能力。在此,我们进一步展示这一固定表示基底作为通用“对接端口”,支持两种强大且高效的扩展范式:无缝模块化组合与渐进式层级增长。
首先,我们证明,针对不同数据集(如俄语与中文文本)训练的专业模型,在无需架构改动的情况下,仅通过简单平均其输出逻辑值,即可在训练后合并为一个能力更强的专家混合模型(MoE)。该MoE模型在MMLU等推理基准上立即展现出性能提升,超越其组成专家且无灾难性遗忘现象。其次,我们提出了一种层级建设性训练方法,通过逐层堆叠并训练,逐步“生长”出一个深度Transformer。此方法展示了稳定的收敛性,以及模型深度与复杂推理能力(如SQuAD所需)涌现之间的明确关联。
我们的发现预示着从整体优化向更生物化或建设性AI开发模式的转变,其中复杂性是逐步构建的,模块可以自由组合。这为资源高效扩展、持续学习以及构建强大AI系统的更民主化生态系统开辟了新途径。我们公开所有代码与模型,以促进进一步研究。
English
The prevailing paradigm for scaling large language models (LLMs) involves
monolithic, end-to-end training, a resource-intensive process that lacks
flexibility. This paper explores an alternative, constructive approach to model
development, built upon the foundation of non-trainable, deterministic input
embeddings. In prior [1], we established that high-level semantic reasoning can
emerge in Transformers using frozen embeddings derived from the visual
structure of Unicode glyphs. Here, we demonstrate that this fixed
representational substrate acts as a universal "docking port," enabling two
powerful and efficient scaling paradigms: seamless modular composition and
progressive layer-wise growth.
First, we show that specialist models trained on disparate datasets (e.g.,
Russian and Chinese text) can be merged into a single, more capable
Mixture-of-Experts (MoE) model, post-training, with zero architectural
modification. This is achieved by simply averaging their output logits. The
resulting MoE model exhibits immediate performance improvements on reasoning
benchmarks like MMLU, surpassing its constituent experts without catastrophic
forgetting. Second, we introduce a layer-wise constructive training
methodology, where a deep Transformer is "grown" by progressively stacking and
training one layer at a time. This method demonstrates stable convergence and a
clear correlation between model depth and the emergence of complex reasoning
abilities, such as those required for SQuAD.
Our findings suggest a paradigm shift from monolithic optimization towards a
more biological or constructive model of AI development, where complexity is
built incrementally and modules can be composed freely. This opens new avenues
for resource-efficient scaling, continual learning, and a more democratized
ecosystem for building powerful AI systems. We release all code and models to
facilitate further research.