混合递归:学习动态递归深度以实现自适应词元级计算
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
July 14, 2025
作者: Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun
cs.AI
摘要
扩展语言模型虽能解锁令人瞩目的能力,但随之而来的计算与内存需求使得训练和部署成本高昂。现有的效率优化工作通常专注于参数共享或自适应计算,而如何同时实现这两者仍是一个待解之题。我们提出了递归混合(Mixture-of-Recursions, MoR)框架,这一统一框架在单一的递归Transformer中融合了两种效率维度。MoR通过在递归步骤间复用共享的层堆栈来实现参数效率,同时轻量级路由器通过动态为各token分配不同的递归深度,实现了自适应的token级思考。这使得MoR能够仅对在特定递归深度仍活跃的token进行二次方注意力计算,并通过选择性缓存这些token的键值对,进一步提升了内存访问效率。除了这些核心机制,我们还提出了一种KV共享变体,该变体复用首次递归的KV对,专门设计用于减少预填充延迟和内存占用。在参数规模从1.35亿到17亿的模型范围内,MoR构建了一个新的帕累托前沿:在相同训练浮点运算次数和更小模型尺寸下,它显著降低了验证困惑度并提升了少样本准确率,同时相比基础及现有递归基线模型提供了更高的吞吐量。这些成果表明,MoR是实现大模型质量而不承担大模型成本的有效途径。
English
Scaling language models unlocks impressive capabilities, but the accompanying
computational and memory demands make both training and deployment expensive.
Existing efficiency efforts typically target either parameter sharing or
adaptive computation, leaving open the question of how to attain both
simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework
that combines the two axes of efficiency inside a single Recursive Transformer.
MoR reuses a shared stack of layers across recursion steps to achieve parameter
efficiency, while lightweight routers enable adaptive token-level thinking by
dynamically assigning different recursion depths to individual tokens. This
allows MoR to focus quadratic attention computation only among tokens still
active at a given recursion depth, further improving memory access efficiency
by selectively caching only their key-value pairs. Beyond these core
mechanisms, we also propose a KV sharing variant that reuses KV pairs from the
first recursion, specifically designed to decrease prefill latency and memory
footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms
a new Pareto frontier: at equal training FLOPs and smaller model sizes, it
significantly lowers validation perplexity and improves few-shot accuracy,
while delivering higher throughput compared with vanilla and existing recursive
baselines. These gains demonstrate that MoR is an effective path towards
large-model quality without incurring large-model cost.