HRM-Text: 超越规模的高效预训练

摘要

当前大型语言模型（LLM）的预训练范式依赖于海量计算资源和互联网规模的原始文本数据，这为基础研究设置了显著的门槛。相比之下，生物系统通过多时间尺度处理（如额顶环路的功能组织）展现出极高的样本效率。受此启发，我们提出了HRM-Text，用分层循环模型（HRM）替代标准Transformer，将计算解耦为慢速演化的策略层和快速演化的执行层。为了稳定这种用于语言建模的深度循环，我们引入了MagicNorm和预热式深度信用分配。此外，我们不再采用标准的原始文本预训练，而是仅使用指令-响应对，并基于任务完成目标和PrefixLM掩码进行训练。作为高效预训练的一个经验性存在证明，一个1B参数的HRM-Text模型从零开始，仅用400亿个独特token和1500美元预算训练，便在MMLU上达到60.7%，在ARC-C上达到81.9%，在DROP上达到82.2%，在GSM8K上达到84.5%，在MATH上达到56.2%。尽管训练token数量仅为标准基线的约1/100至1/900，估计计算量仅为1/96至1/432，HRM-Text仍能与参数规模为2-7B的开源模型竞争。这些结果表明，协同设计架构与训练目标能够显著降低计算与性能之比，使从零开始的预训练对更广泛的研究社区成为可能。

English

The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.