凍結基盤上でのモジュール化構成と層ごとの拡張によるトランスフォーマーの成長

要旨

大規模言語モデル（LLM）のスケーリングにおける主流のパラダイムは、リソース集約的で柔軟性に欠ける、モノリシックなエンドツーエンドのトレーニングに依存している。本論文では、非学習型で決定論的な入力埋め込みを基盤とした、モデル開発のための代替的かつ構成的なアプローチを探求する。先行研究[1]において、Unicodeグリフの視覚的構造から導出された凍結埋め込みを用いることで、Transformerにおいて高レベルの意味推論が発現し得ることを示した。ここでは、この固定された表現基盤が普遍的な「ドッキングポート」として機能し、シームレスなモジュール構成と段階的な層ごとの成長という、強力かつ効率的なスケーリングパラダイムを可能にすることを実証する。まず、異なるデータセット（例えばロシア語と中国語のテキスト）でトレーニングされた専門家モデルを、トレーニング後に単一のより強力なMixture-of-Experts（MoE）モデルに統合できることを示す。これは、出力ロジットを単純に平均化するだけで達成され、アーキテクチャの変更を一切必要としない。結果として得られたMoEモデルは、MMLUなどの推論ベンチマークにおいて即座に性能向上を示し、構成要素となる専門家モデルを凌駕する一方で、破滅的な忘却を引き起こさない。次に、深層Transformerを層ごとに段階的に積み重ねてトレーニングする「成長型」の構成的トレーニング手法を提案する。この手法は安定した収束を示し、モデルの深さとSQuADのような複雑な推論能力の発現との間に明確な相関関係があることを示す。我々の知見は、モノリシックな最適化から、複雑性が段階的に構築され、モジュールが自由に構成可能な、より生物学的または構成的なAI開発モデルへのパラダイムシフトを示唆している。これは、リソース効率の良いスケーリング、継続学習、そして強力なAIシステムを構築するためのより民主化されたエコシステムへの新たな道を開くものである。さらなる研究を促進するため、すべてのコードとモデルを公開する。

English

The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal "docking port," enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.

凍結基盤上でのモジュール化構成と層ごとの拡張によるトランスフォーマーの成長

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

要旨

Support