循环语言模型的价值何在？基于等深度的缩放定律探究

摘要

我们通过等效独立参数量来衡量额外循环层对循环（深度递归）语言模型的价值。基于循环次数r∈{1,2,4,8}的116次预训练实验（训练计算量跨度约50倍），我们拟合出联合缩放定律L = E + A·(N_once + r^φ N_rec)^{-α} + B·D^{-β}，并得出新的循环等效指数φ=0.46。直观而言，φ值可揭示将模块循环r次在验证损失上是否等效于非循环模型的r个独立模块（完全等效，φ=1），或等效于无容量增益的单一模块重复运行（φ=0）。我们的φ=0.46处于中间状态，表明在相同训练计算量下，每增加一次循环都会可预测地提高验证损失。例如当r=4时，4.1亿参数的循环模型性能与5.8亿参数的非循环模型相当，但训练成本却相当于10亿参数的非循环模型。我们通过两个实验验证φ作为测量工具的有效性：截断反向传播使φ降至0.38，说明即使验证损失降低，循环机制在截断训练下仍未充分训练；而超连接技术将φ提升至0.65，实现了真正的容量增益。该方法适用于所有循环语言模型，能有效区分真正的循环改进与词元预算增益。

English

We measure how much one extra recurrence is worth to a looped (depth-recurrent) language model, in equivalent unique parameters. From an iso-depth sweep of 116 pretraining runs across recurrence counts r in {1, 2, 4, 8} spanning {sim}50times in training compute, we fit a joint scaling law L = E + A,(N_once + r^φ N_rec)^{-α} + B,D^{-β} and recover a new recurrence-equivalence exponent φ= 0.46. Intuitively, φ tells us whether looping a block r times is equivalent in validation loss to r unique blocks of a non-looped model (full equivalence, φ{=}1) or to a single block run repeatedly with no capacity gain (φ{=}0). Our φ= 0.46 sits in between, so each additional recurrence predictably increases validation loss at matched training compute. For example, at r{=}4 a 410M looped model performs on par with a 580M non-looped model, but incurs the training cost of a 1B non-looped one. We demonstrate the utility of φ as a measurement tool on two probes. Truncated backpropagation lowers φ to 0.38, indicating that the loop mechanism is poorly trained under truncation, even though validation loss decreases. Conversely, hyperconnections raise φ to 0.65, a genuine capacity gain. Our method applies to any looped LM and separates true loop improvements from token-budget gains.