超越尺度定律：透過關聯記憶理解Transformer的效能

摘要

增加 Transformer 模型的大小並不總是能提高性能。這種現象無法通過實證縮放定律來解釋。此外，當模型記憶訓練樣本時，改善的泛化能力就會出現。我們提出了一個理論框架，闡明了基於 Transformer 的語言模型的記憶過程和性能動態。我們使用 Hopfield 網絡將 Transformer 的行為建模為具有聯想記憶的模型，這樣每個 Transformer 塊都有效地進行了近似最近鄰搜索。基於此，我們設計了一個類似於現代連續 Hopfield 網絡中的能量函數，為注意機制提供了深入的解釋。通過主導極小化技術，我們構建了一個全局能量函數，捕捉了 Transformer 的分層架構。在特定條件下，我們表明最小可達到的交叉熵損失下限受到一個約等於 1 的常數的限制。我們通過在各種數據大小上使用 GPT-2 進行實驗，以及在包含 2M 個標記的數據集上訓練基本 Transformer，來證實我們的理論結果。

English

Increasing the size of a Transformer model does not always lead to enhanced performance. This phenomenon cannot be explained by the empirical scaling laws. Furthermore, improved generalization ability occurs as the model memorizes the training samples. We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models. We model the behavior of Transformers with associative memories using Hopfield networks, such that each transformer block effectively conducts an approximate nearest-neighbor search. Based on this, we design an energy function analogous to that in the modern continuous Hopfield network which provides an insightful explanation for the attention mechanism. Using the majorization-minimization technique, we construct a global energy function that captures the layered architecture of the Transformer. Under specific conditions, we show that the minimum achievable cross-entropy loss is bounded from below by a constant approximately equal to 1. We substantiate our theoretical results by conducting experiments with GPT-2 on various data sizes, as well as training vanilla Transformers on a dataset of 2M tokens.

超越尺度定律：透過關聯記憶理解Transformer的效能

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

摘要

Support