신경 세포 자동자를 통한 언어 모델 학습

초록

대규모 언어 모델(LLM)에서 사전 학습은 대부분의 표현과 능력이 습득되는 중요한 단계입니다. 그러나 자연어 사전 학습에는 한계가 있습니다: 고품질 텍스트는 유한하며, 인간의 편향을 내포하고, 지식과 추론이 뒤엉켜 있습니다. 이는 근본적인 질문을 제기합니다: 자연어가 지능을 달성하는 유일한 경로인가? 우리는 신경 세포 자동자(NCA)를 활용해 LLM의 사전-사전 학습(합성 데이터 후 자연어 순차 학습)을 위한 비언어적 합성 데이터를 생성하는 방법을 제안합니다. NCA 데이터는 자연어와 유사한 풍부한 시공간적 구조와 통계적 특성을 보이면서도 대규모 생성이 통제 가능하고 저렴합니다. 단 1억 6,400만 NCA 토큰만으로 사전-사전 학습을 수행하면 하류 언어 모델링 성능이 최대 6% 향상되고 수렴 속도가 최대 1.6배 가속화된다는 사실을 발견했습니다. 놀랍게도 이는 Common Crawl의 16억 자연어 토큰으로 더 많은 계산량을 투입해 사전-사전 학습한 경우보다도 우수한 성능을 보였습니다. 이러한 이점은 GSM8K, HumanEval, BigBench-Lite 등 추론 벤치마크로도 이전되었습니다. 전이 효과의 원인을 분석한 결과, 어텐션 계층이 가장 잘 전이되며, 최적의 NCA 복잡도는 도메인에 따라 달라짐을 확인했습니다: 코드는 단순한 동역학에서 이점을 얻는 반면, 수학 및 웹 텍스트는 더 복잡한 동역학을 선호합니다. 이러한 결과는 합성 데이터 분포를 대상 도메인에 맞게 체계적으로 조정할 수 있는 길을 열어줍니다. 더 넓게 보면, 우리의 연구는 완전한 합성 사전 학습을 통해 더 효율적인 모델을 개발하는 길을 제시합니다.

English

Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.

신경 세포 자동자를 통한 언어 모델 학습

Training Language Models via Neural Cellular Automata

초록

Support