Mixture-of-Recursions: 適応的なトークンレベル計算のための動的再帰深度の学習

要旨

言語モデルのスケーリングは印象的な能力を解き放つが、それに伴う計算量とメモリ要求により、トレーニングとデプロイメントの両方が高コストになる。既存の効率化努力は通常、パラメータ共有または適応的計算のいずれかを対象としており、両方を同時に達成する方法は未解決のままであった。本論文では、Mixture-of-Recursions（MoR）を紹介する。これは、単一の再帰型Transformer内で効率化の2つの軸を統合する統一フレームワークである。MoRは、再帰ステップ間で共有されたレイヤスタックを再利用することでパラメータ効率を実現し、軽量なルータが個々のトークンに異なる再帰深度を動的に割り当てることで、適応的なトークンレベルの思考を可能にする。これにより、MoRは特定の再帰深度でまだアクティブなトークン間でのみ二次的なアテンション計算に集中し、それらのキー・バリューペアを選択的にキャッシュすることでメモリアクセス効率をさらに向上させる。これらのコアメカニズムに加えて、最初の再帰からのキー・バリューペアを再利用するKV共有バリアントも提案し、プリフィルレイテンシとメモリフットプリントの削減を特に目的としている。135Mから1.7Bパラメータまでのモデルスケールにおいて、MoRは新しいパレートフロンティアを形成する。同等のトレーニングFLOPsとより小さなモデルサイズで、検証パープレキシティを大幅に低下させ、few-shot精度を向上させるとともに、バニラおよび既存の再帰型ベースラインと比較して高いスループットを提供する。これらの利点は、MoRが大規模モデルの品質を大規模モデルのコストなしに実現する効果的な道筋であることを示している。

English

Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

Mixture-of-Recursions: 適応的なトークンレベル計算のための動的再帰深度の学習

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

要旨

Support