HRM-Text: スケーリングを超えた効率的な事前学習

要旨

現在の大規模言語モデルの事前学習パラダイムは、膨大な計算リソースとインターネット規模の生テキストに依存しており、基礎研究への大きな障壁となっている。対照的に、生物学的システムは、前頭頭頂ループの機能的構成に見られるような複数時間スケールの処理を通じて、極めてサンプル効率的な学習を示す。これに着想を得て我々はHRM-Textを導入する。これは標準的なTransformerを、計算を緩やかに進化する戦略層と急速に進化する実行層に分離する階層型リカレントモデル（HRM）に置き換えるものである。この言語モデリングのための深い再帰を安定させるために、我々はMagicNormと深層クレジット割り当てのウォームアップを導入する。さらに、標準的な生テキストの事前学習の代わりに、タスク完了目的関数とPrefixLMマスキングを用いて、指示-応答ペアのみで訓練を行う。効率的な事前学習の実証的存在証明として、わずか400億のユニークトークンと1,500ドルの予算でスクラッチから訓練された1BパラメータのHRM-Textモデルは、MMLUで60.7%、ARC-Cで81.9%、DROPで82.2%、GSM8Kで84.5%、MATHで56.2%を達成する。これは標準的なベースラインと比較して、訓練トークン数で約100～900分の1、推定計算量で96～432分の1であるにもかかわらず、HRM-Textは2～7Bパラメータのオープンモデルと競合する性能を示す。これらの結果は、アーキテクチャと目的関数を共同設計することで計算対性能比を劇的に削減でき、より広範な研究コミュニティがスクラッチからの事前学習にアクセス可能になることを実証している。

English

The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.