您並未充分利用Transformer的表達能力

摘要

與將先前詞元壓縮為單一隱藏狀態的RNN不同，Transformer能夠直接關注所有先前的詞元。然而，標準的Transformer僅使用緊鄰前一層的表徵。本文中，我們展示了這種設計選擇會導致表徵崩潰，並造成次優性能。為解決這一問題，我們引入了層集成記憶（LIMe），這是一種簡單而強大的方法，在保持模型整體記憶佔用量的同時，通過允許訪問早期層的隱藏狀態來擴展其表徵能力。通過在各種架構和不同查找機制上的廣泛實驗，我們證明了在廣泛任務上的一致性能提升。此外，我們對學習到的表徵動態的分析以及對深度電路的探索揭示了LIMe如何跨層整合信息，為未來研究指出了有前景的方向。

English

In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.

您並未充分利用Transformer的表達能力

You Do Not Fully Utilize Transformer's Representation Capacity

摘要

Support