Entrenamiento de Modelos de Lenguaje mediante Autómatas Celulares Neuronales

Resumen

El preentrenamiento es crucial para los modelos de lenguaje grandes (LLM), ya que es cuando se adquieren la mayoría de las representaciones y capacidades. Sin embargo, el preentrenamiento con lenguaje natural presenta problemas: el texto de alta calidad es finito, contiene sesgos humanos y entrelaza el conocimiento con el razonamiento. Esto plantea una pregunta fundamental: ¿es el lenguaje natural el único camino hacia la inteligencia? Proponemos utilizar autómatas celulares neuronales (NCA) para generar datos sintéticos no lingüísticos para el pre-preen-trenamiento de LLM—es decir, entrenar primero con lenguaje sintético y luego con lenguaje natural. Los datos de NCA exhiben una rica estructura espacio-temporal y estadísticas similares al lenguaje natural, mientras que son controlables y baratos de generar a gran escala. Descubrimos que el pre-preen-trenamiento con solo 164 millones de tokens de NCA mejora el modelado del lenguaje posterior hasta en un 6% y acelera la convergencia hasta en 1.6 veces. Sorprendentemente, esto supera incluso al pre-preen-trenamiento con 1.600 millones de tokens de lenguaje natural de Common Crawl, que requiere mayor poder computacional. Estas ganancias también se transfieren a benchmarks de razonamiento, incluidos GSM8K, HumanEval y BigBench-Lite. Al investigar qué impulsa la transferencia, encontramos que las capas de atención son las más transferibles, y que la complejidad óptima del NCA varía según el dominio: el código se beneficia de dinámicas más simples, mientras que las matemáticas y el texto web favorecen dinámicas más complejas. Estos resultados permiten ajustar sistemáticamente la distribución sintética para dominios específicos. En términos más amplios, nuestro trabajo abre un camino hacia modelos más eficientes con preentrenamiento totalmente sintético.

English

Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.

Entrenamiento de Modelos de Lenguaje mediante Autómatas Celulares Neuronales

Training Language Models via Neural Cellular Automata

Resumen

Support