混合遞迴模型：學習動態遞迴深度以實現自適應的詞元級計算

摘要

擴展語言模型釋放了令人矚目的能力，但伴隨而來的計算與記憶體需求使得訓練與部署成本高昂。現有的效率提升措施通常專注於參數共享或自適應計算，而如何同時實現這兩者仍是一個未解之題。我們引入了遞迴混合（Mixture-of-Recursions, MoR），這是一個統一框架，在單一的遞迴Transformer內結合了這兩種效率維度。MoR通過在遞迴步驟間重複使用共享的層堆疊來實現參數效率，而輕量級路由器則通過動態為個別詞元分配不同的遞迴深度，實現了自適應的詞元級思考。這使得MoR能夠僅在特定遞迴深度下仍活躍的詞元間進行二次方注意力計算，並通過選擇性僅緩存這些詞元的鍵值對，進一步提升了記憶體存取效率。除了這些核心機制外，我們還提出了一種鍵值共享變體，該變體重複利用首次遞迴的鍵值對，專門設計用於減少預填充延遲和記憶體佔用。在從1.35億到17億參數的模型規模範圍內，MoR構建了一條新的帕累托前沿：在相同的訓練浮點運算次數和更小的模型尺寸下，它顯著降低了驗證困惑度並提升了少樣本準確率，同時相比於基礎及現有的遞迴基準模型，提供了更高的吞吐量。這些成果表明，MoR是實現大模型質量而不承擔大模型成本的有效途徑。

English

Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

混合遞迴模型：學習動態遞迴深度以實現自適應的詞元級計算

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

摘要

Support