透過神經元胞自動機訓練語言模型

摘要

预训练对大型语言模型（LLMs）具有关键意义，因为模型在此期间习得大部分表征与能力。然而自然语言预训练存在诸多问题：高质量文本资源有限、内含人类偏见，且将知识与推理能力相互纠缠。这引发了一个根本性质疑：自然语言是否是实现智能的唯一路径？我们提出采用神经元胞自动机（NCA）生成合成非语言数据，用于LLMs的预预训练——即先合成语言后自然语言的训练范式。NCA数据展现出丰富的时空结构与类自然语言的统计特征，同时具备可控性强、大规模生成成本低的优势。实验表明，仅使用1.64亿个NCA标记进行预预训练，即可将下游语言建模性能提升最高达6%，收敛速度加快至1.6倍。令人惊讶的是，其效果甚至优于在Common Crawl自然语言语料上耗费更多算力进行的16亿标记预预训练。这些增益同样体现在GSM8K、HumanEval和BigBench-Lite等推理基准测试中。通过探究迁移机制，我们发现注意力层的可迁移性最强，且最优NCA复杂度因领域而异：代码领域受益于更简单的动力学规则，而数学与网络文本领域则偏好更复杂的规则。这一发现使得我们能针对目标领域系统化调整合成数据分布。更广泛而言，我们的研究为通过全合成预训练构建更高效模型开辟了新路径。

English

Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.

透過神經元胞自動機訓練語言模型

Training Language Models via Neural Cellular Automata

摘要

Support