重新思考注意力：探索淺層前饋神經網絡作為變壓器中注意力層的替代方案

摘要

本研究分析了使用標準淺層前饋網絡來模擬原始Transformer模型中注意機制行為的有效性，該模型是用於序列到序列任務的最先進架構。我們將Transformer中注意機制的關鍵元素替換為簡單的前饋網絡，通過知識蒸餾使用原始組件進行訓練。我們在IWSLT2017數據集上進行的實驗顯示這些“無注意Transformer”具有與原始架構相匹敵的性能。通過嚴格的消融研究，並嘗試不同替換網絡類型和大小，我們提供支持我們方法可行性的見解。這不僅揭示了淺層前饋網絡在模擬注意機制方面的適應性，還強調了它們簡化序列到序列任務的複雜架構的潛力。

English

This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these "attentionless Transformers" to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.

重新思考注意力：探索淺層前饋神經網絡作為變壓器中注意力層的替代方案

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

摘要

Support