通过神经细胞自动机训练语言模型

摘要

预训练对大语言模型（LLM）至关重要，因为模型在此期间习得大部分表征与能力。然而自然语言预训练存在诸多问题：高质量文本资源有限、内含人类偏见、且知识常与推理能力相互纠缠。这引发了一个根本性问题：自然语言是通往智能的唯一途径吗？我们提出使用神经元胞自动机（NCA）生成合成非语言数据，用于对大语言模型进行预预训练——即采用"先合成语言后自然语言"的训练策略。NCA数据不仅展现出与自然语言相似的丰富时空结构和统计特征，还具有可控性强、大规模生成成本低的优势。实验表明，仅使用1.64亿个NCA标记进行预预训练，即可将下游语言建模性能提升最高达6%，收敛速度加快最高达1.6倍。令人惊讶的是，其效果甚至优于使用计算资源更多、基于16亿个Common Crawl自然语言标记的预预训练。这些增益还能迁移至推理基准测试（包括GSM8K、HumanEval和BigBench-Lite）。通过探究迁移机制，我们发现注意力层的可迁移性最强，且最优NCA复杂度因领域而异：编程任务受益于更简单的动态规则，而数学与网页文本任务则偏好更复杂的动态规则。这些发现使得我们能针对目标领域系统调整合成数据分布。更广泛而言，我们的研究为通过全合成预训练构建更高效模型开辟了新路径。

English

Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.