ニューラルセルオートマトンによる言語モデルの訓練

要旨

大規模言語モデル（LLM）において、事前学習は表現能力と各種機能の大部分が獲得される極めて重要な段階である。しかし、自然言語を用いた事前学習には問題点がある。高品質なテキストデータには限界があり、人間のバイアスを含み、知識と推論能力が密接に絡み合っている。これにより、自然言語こそが知能獲得への唯一の経路なのかという根本的疑問が生じる。本研究では、神経細胞オートマトン（NCA）を活用し、非言語的な合成データを生成してLLMの事前事前学習（合成データ→自然言語の二段階学習）に用いる手法を提案する。NCAデータは、自然言語に類似した豊かな時空間構造と統計的特性を示しつつ、大規模生成が低コストで制御可能である。わずか1億6400万トークンのNCAデータによる事前事前学習により、下流の言語モデリングタスクで最大6%の精度向上と最大1.6倍の収束加速を達成した。驚くべきことに、この結果は計算量を増やしてCommon Crawlの16億自然言語トークンで事前事前学習した場合を上回る。この効果はGSM8K、HumanEval、BigBench-Liteなどの推論ベンチマークにも転移した。転移効果の要因を調査すると、Attention層の転移性が最も高く、最適なNCAの複雑度は領域によって異なることが判明した。コード領域では単純な力学が、数学やウェブテキストでは複雑な力学が有効である。これらの知見は、対象領域に応じて合成データ分布を体系的に調整することを可能にする。より広義には、本研究成果は完全合成データによる事前学習を通じた効率的なモデル開発への道を開くものである。

English

Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.

ニューラルセルオートマトンによる言語モデルの訓練

Training Language Models via Neural Cellular Automata

要旨

Support