ChatPaper.aiChatPaper

多項式組合激活:釋放大型語言模型的動態

Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models

November 6, 2024
作者: Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma
cs.AI

摘要

由於其強大的擬合能力,Transformer 在各個領域中已經找到廣泛的應用。這種成功部分歸因於它們固有的非線性。因此,除了原始 Transformer 結構中使用的 ReLU 函數外,研究人員還探索了諸如 GeLU 和 SwishGLU 等替代模塊,以增強非線性,從而擴大表示能力。在本文中,我們提出了一種新類別的多項式組合激活函數(PolyCom),旨在優化 Transformer 的動態特性。從理論上講,我們對 PolyCom 進行了全面的數學分析,突出了相對於其他激活函數的增強表達能力和有效性。值得注意的是,我們證明了集成 PolyCom 的網絡實現了最佳的逼近速率,這表明 PolyCom 網絡需要最少的參數來逼近 Sobolev 空間中的一般平滑函數。我們對大型語言模型(LLM)的預訓練配置進行了實證實驗,包括密集和稀疏結構。通過將傳統激活函數替換為 PolyCom,我們使 LLM 能夠捕捉數據中的高階交互作用,從而在準確性和收斂速率方面改善性能指標。廣泛的實驗結果證明了我們方法的有效性,顯示相對於其他激活函數,取得了顯著的改進。程式碼可在 https://github.com/BryceZhuo/PolyCom 找到。
English
Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the optimal approximation rate, indicating that PolyCom networks require minimal parameters to approximate general smooth functions in Sobolev spaces. We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with PolyCom, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions. Code is available at https://github.com/BryceZhuo/PolyCom.
PDF281November 13, 2024