HRM-Text: 超越规模的高效预训练
HRM-Text: Efficient Pretraining Beyond Scaling
May 20, 2026
作者: Guan Wang, Changling Liu, Chenyu Wang, Cai Zhou, Yuhao Sun, Yifei Wu, Shuai Zhen, Luca Scimeca, Yasin Abbasi Yadkori
cs.AI
摘要
当前大型语言模型(LLM)的预训练范式依赖于海量计算资源和互联网规模的原始文本数据,这为基础研究设置了显著的门槛。相比之下,生物系统通过多时间尺度处理(如额顶环路的功能组织)展现出极高的样本效率。受此启发,我们提出了HRM-Text,用分层循环模型(HRM)替代标准Transformer,将计算解耦为慢速演化的策略层和快速演化的执行层。为了稳定这种用于语言建模的深度循环,我们引入了MagicNorm和预热式深度信用分配。此外,我们不再采用标准的原始文本预训练,而是仅使用指令-响应对,并基于任务完成目标和PrefixLM掩码进行训练。作为高效预训练的一个经验性存在证明,一个1B参数的HRM-Text模型从零开始,仅用400亿个独特token和1500美元预算训练,便在MMLU上达到60.7%,在ARC-C上达到81.9%,在DROP上达到82.2%,在GSM8K上达到84.5%,在MATH上达到56.2%。尽管训练token数量仅为标准基线的约1/100至1/900,估计计算量仅为1/96至1/432,HRM-Text仍能与参数规模为2-7B的开源模型竞争。这些结果表明,协同设计架构与训练目标能够显著降低计算与性能之比,使从零开始的预训练对更广泛的研究社区成为可能。
English
The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.