ChatPaper.aiChatPaper

更强的无归一化Transformer架构

Stronger Normalization-Free Transformers

December 11, 2025
作者: Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu
cs.AI

摘要

尽管归一化层长期被视为深度学习架构中不可或缺的组成部分,但动态双曲正切函数(DyT)的提出证明了替代方案的存在可能。这种点态函数通过约束极端值实现稳定收敛,并达到归一化级别的性能;本研究旨在探索能超越该性能的函数设计。我们首先探究点态函数的内在特性如何影响训练与性能,基于这些发现展开大规模搜索以寻求更有效的函数设计。通过系统探索,我们提出Derf(x) = erf(αx + s)函数(其中erf(x)为缩放后的高斯累积分布函数),并确认其为实现最优性能的设计。在图像识别与生成、语音表征、DNA序列建模等广泛领域中,Derf的表现均优于层归一化、RMSNorm及DyT。研究发现Derf的性能提升主要源于其增强的泛化能力而非拟合能力。该函数的简洁性与卓越性能使其成为无归一化Transformer架构的理想选择。
English
Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce Derf(x) = erf(αx + s), where erf(x) is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.
PDF61December 13, 2025