트랜스포머의 표현 능력을 완전히 활용하지 못하고 있다

초록

이전 토큰들을 단일 은닉 상태로 압축하는 RNN과 달리, 트랜스포머는 모든 이전 토큰에 직접 주의를 기울일 수 있습니다. 그러나 표준 트랜스포머는 바로 이전 계층의 표현만을 사용합니다. 본 논문에서는 이러한 설계 선택이 표현 붕괴를 초래하고 최적이 아닌 성능으로 이어짐을 보여줍니다. 이 문제를 해결하기 위해, 우리는 모델의 전체 메모리 사용량을 유지하면서 초기 계층의 은닉 상태에 접근할 수 있게 함으로써 표현 능력을 확장하는 간단하지만 강력한 접근 방식인 계층 통합 메모리(LIMe)를 소개합니다. 다양한 아키텍처와 조회 메커니즘에 걸친 광범위한 실험을 통해, 우리는 다양한 작업에서 일관된 성능 향상을 입증합니다. 더욱이, 학습된 표현 역학에 대한 분석과 깊이별 회로 탐구를 통해 LIMe가 계층 간 정보를 통합하는 방식을 밝히며, 향후 연구를 위한 유망한 방향을 제시합니다.

English

In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.

트랜스포머의 표현 능력을 완전히 활용하지 못하고 있다

You Do Not Fully Utilize Transformer's Representation Capacity

초록

Support