ChatPaper.aiChatPaper

循环语言模型的价值何在?基于等深度的缩放定律探究

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

April 27, 2026
作者: Kristian Schwethelm, Daniel Rueckert, Georgios Kaissis
cs.AI

摘要

我们通过等效独立参数量来衡量额外循环层对循环(深度递归)语言模型的价值。基于循环次数r∈{1,2,4,8}的116次预训练实验(训练计算量跨度约50倍),我们拟合出联合缩放定律L = E + A·(N_once + r^φ N_rec)^{-α} + B·D^{-β},并得出新的循环等效指数φ=0.46。直观而言,φ值可揭示将模块循环r次在验证损失上是否等效于非循环模型的r个独立模块(完全等效,φ=1),或等效于无容量增益的单一模块重复运行(φ=0)。我们的φ=0.46处于中间状态,表明在相同训练计算量下,每增加一次循环都会可预测地提高验证损失。例如当r=4时,4.1亿参数的循环模型性能与5.8亿参数的非循环模型相当,但训练成本却相当于10亿参数的非循环模型。我们通过两个实验验证φ作为测量工具的有效性:截断反向传播使φ降至0.38,说明即使验证损失降低,循环机制在截断训练下仍未充分训练;而超连接技术将φ提升至0.65,实现了真正的容量增益。该方法适用于所有循环语言模型,能有效区分真正的循环改进与词元预算增益。
English
We measure how much one extra recurrence is worth to a looped (depth-recurrent) language model, in equivalent unique parameters. From an iso-depth sweep of 116 pretraining runs across recurrence counts r in {1, 2, 4, 8} spanning {sim}50times in training compute, we fit a joint scaling law L = E + A,(N_once + r^φ N_rec)^{-α} + B,D^{-β} and recover a new recurrence-equivalence exponent φ= 0.46. Intuitively, φ tells us whether looping a block r times is equivalent in validation loss to r unique blocks of a non-looped model (full equivalence, φ{=}1) or to a single block run repeatedly with no capacity gain (φ{=}0). Our φ= 0.46 sits in between, so each additional recurrence predictably increases validation loss at matched training compute. For example, at r{=}4 a 410M looped model performs on par with a 580M non-looped model, but incurs the training cost of a 1B non-looped one. We demonstrate the utility of φ as a measurement tool on two probes. Truncated backpropagation lowers φ to 0.38, indicating that the loop mechanism is poorly trained under truncation, even though validation loss decreases. Conversely, hyperconnections raise φ to 0.65, a genuine capacity gain. Our method applies to any looped LM and separates true loop improvements from token-budget gains.