あなたはTransformerの表現能力を十分に活用していない

要旨

RNNが以前のトークンを単一の隠れ状態に圧縮するのに対し、Transformerはすべての以前のトークンに直接アテンションを向けることができます。しかし、標準的なTransformerは直前の層からの表現のみを使用します。本論文では、この設計選択が表現の崩壊を引き起こし、最適でない性能につながることを示します。この問題に対処するため、我々はLayer-Integrated Memory（LIMe）を提案します。これは、モデルの全体的なメモリフットプリントを維持しつつ、より早期の層からの隠れ状態へのアクセスを可能にすることで表現能力を拡張する、シンプルでありながら強力なアプローチです。様々なアーキテクチャと異なるルックアップメカニズムを用いた広範な実験を通じて、我々は幅広いタスクにおいて一貫した性能向上を実証します。さらに、学習された表現のダイナミクスの分析と深さ方向の回路の探求により、LIMeがどのように層間の情報を統合するかを明らかにし、今後の研究に向けた有望な方向性を示します。

English

In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.

あなたはTransformerの表現能力を十分に活用していない

You Do Not Fully Utilize Transformer's Representation Capacity

要旨

Support