更强大的无归一化Transformer模型

摘要

尽管归一化层长期被视为深度学习架构中不可或缺的组成部分，但动态双曲正切函数（DyT）的提出表明替代方案是存在的。该点态函数通过约束极值实现稳定收敛，并达到归一化级别的性能；本研究旨在探索能超越该性能的函数设计。我们首先探究了点态函数的内在特性如何影响训练与性能，并基于这些发现展开了大规模搜索以寻求更有效的函数设计。通过系统探索，我们提出了Derf(x) = erf(αx + s)（其中erf(x)为缩放后的高斯累积分布函数），并确认其为最优性能设计。Derf在视觉（图像识别与生成）、语音表征及DNA序列建模等广泛领域中均优于LayerNorm、RMSNorm和DyT。研究发现Derf的性能提升主要源于其增强的泛化能力而非拟合能力。其简洁性与卓越性能使Derf成为无归一化Transformer架构的理想选择。

English

Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce Derf(x) = erf(αx + s), where erf(x) is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.

更强大的无归一化Transformer模型

Stronger Normalization-Free Transformers

摘要

Support