HRM-Text:超越规模的高效预训练
HRM-Text: Efficient Pretraining Beyond Scaling
May 20, 2026
作者: Guan Wang, Changling Liu, Chenyu Wang, Cai Zhou, Yuhao Sun, Yifei Wu, Shuai Zhen, Luca Scimeca, Yasin Abbasi Yadkori
cs.AI
摘要
目前大语言模型的预训练范式依赖于海量算力和互联网规模的原始文本,这为基础研究设置了显著的门槛。相比之下,生物系统通过多时间尺度处理(例如额顶叶环路的功能组织)展现出极高的样本效率。受此启发,我们提出HRM-Text,用层级递归模型(HRM)替代标准Transformer,将计算解耦为缓慢演化的策略层与快速演化的执行层。为稳定这种深度递归在语言建模中的应用,我们引入MagicNorm与预热深度信用分配。此外,我们摒弃了标准的原始文本预训练,转而仅使用指令-响应对进行训练,采用任务完成目标与PrefixLM掩码。作为高效预训练的经验性存在证明,一个从头训练的10亿参数HRM-Text模型仅使用400亿唯一词元和1500美元预算,便在MMLU上达到60.7%,ARC-C上81.9%,DROP上82.2%,GSM8K上84.5%,MATH上56.2%。尽管相较于标准基线,其训练词元量减少约100-900倍,估算算力减少96-432倍,HRM-Text仍能与20-70亿参数的开源模型竞争。这些结果表明,通过协同设计架构与目标,能够根本性地降低算力与性能之比,使从零开始的预训练对更广泛的研究社区触手可及。
English
The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.