《帕尔卡：稳定循环语言模型的缩放定律》

摘要

传统的固定深度架构通过增加训练浮点运算量来提升质量，通常借助参数规模的扩大，但这会以更高的内存占用或数据需求为代价。循环架构是一种潜在的替代方案，它通过将激活值在层块中循环传递来增加浮点运算量。尽管前景可观，现有的循环架构训练方案存在不稳定性，常出现残差爆炸和损失值尖峰等问题。我们通过将循环操作重构为残差流上的非线性时变动力系统来解决这些挑战。借助该系统的线性近似模型，我们发现现有循环架构的不稳定性源于其注入参数的大谱范数。针对这些不稳定问题，我们提出Parcae——一种通过负对角参数化离散化来约束注入参数谱范数的新型稳定循环架构。实验表明，Parcae相比先前大规模循环模型将验证困惑度降低了6.3%。基于这一稳定架构，我们系统研究了循环机制作为提升质量的媒介，在训练和测试阶段通过增加浮点运算量实现的扩展特性。训练方面，我们推导出在固定参数量的情况下可预测的浮点运算量缩放幂律。初始缩放规律表明，在固定浮点运算预算下，应同步增加循环次数和数据量。测试阶段，我们发现Parcae能通过循环实现计算资源的可扩展性，并遵循可预测的饱和指数衰减规律。当参数量扩展至13亿时，在固定参数和数据预算下，Parcae相较强Transformer基线在CORE和Core-Extended指标上分别提升2.99和1.18个点，达到了两倍规模Transformer模型87.5%的相对质量水平。

English

Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher memory footprint, or data. A potential alternative is looped architectures, which instead increase FLOPs by sending activations through a block of layers in a loop. While promising, existing recipes for training looped architectures can be unstable, suffering from residual explosion and loss spikes. We address these challenges by recasting looping as a nonlinear time-variant dynamical system over the residual stream. Via a linear approximation to this system, we find that instability occurs in existing looped architectures as a result of large spectral norms in their injection parameters. To address these instability issues, we propose Parcae, a novel stable, looped architecture that constrains the spectral norm of the injection parameters via discretization of a negative diagonal parameterization. As a result, Parcae achieves up to 6.3% lower validation perplexity over prior large-scale looped models. Using our stable looped architecture, we investigate the scaling properties of looping as a medium to improve quality by increasing FLOPs in training and test-time. For training, we derive predictable power laws to scale FLOPs while keeping parameter count fixed. Our initial scaling laws suggest that looping and data should be increased in tandem, given a fixed FLOP budget. At test-time, we find that Parcae can use looping to scale compute, following a predictable, saturating exponential decay. When scaled up to 1.3B parameters, we find that Parcae improves CORE and Core-Extended quality by 2.99 and 1.18 points when compared to strong Transformer baselines under a fixed parameter and data budget, achieving a relative quality of up to 87.5% a Transformer twice the size.

《帕尔卡：稳定循环语言模型的缩放定律》

Parcae: Scaling Laws For Stable Looped Language Models

摘要

Support