주의력 재고: 트랜스포머의 어텐션 레이어 대안으로서 얕은 피드포워드 신경망 탐구

초록

본 연구는 시퀀스-투-시퀀스 작업을 위한 최신 아키텍처인 원본 Transformer 모델의 어텐션 메커니즘 동작을 모방하기 위해 표준 얕은 피드포워드 네트워크를 사용하는 방법의 효과를 분석합니다. 우리는 Transformer의 어텐션 메커니즘의 핵심 요소를 단순한 피드포워드 네트워크로 대체하고, 지식 증류를 통해 원본 구성 요소를 사용하여 이를 학습시켰습니다. IWSLT2017 데이터셋에서 수행한 실험을 통해, 이러한 "어텐션 없는 Transformer"가 원본 아키텍처의 성능에 필적할 수 있는 능력을 보여줍니다. 엄격한 제거 연구와 다양한 대체 네트워크 유형 및 크기에 대한 실험을 통해, 우리는 이 접근법의 타당성을 뒷받침하는 통찰을 제공합니다. 이는 얕은 피드포워드 네트워크가 어텐션 메커니즘을 모방하는 데 있어 적응력이 있음을 밝힐 뿐만 아니라, 시퀀스-투-시퀀스 작업을 위한 복잡한 아키텍처를 간소화할 수 있는 잠재력을 강조합니다.

English

This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these "attentionless Transformers" to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.

주의력 재고: 트랜스포머의 어텐션 레이어 대안으로서 얕은 피드포워드 신경망 탐구

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

초록

Support