重新思考注意力：探索浅层前馈神经网络作为变压器中注意力层的替代方案

摘要

本研究分析了使用标准的浅层前馈网络来模拟原始Transformer模型中注意力机制行为的有效性。Transformer是一种用于序列到序列任务的最先进架构。我们用简单的前馈网络替换Transformer中注意力机制的关键元素，并通过知识蒸馏训练这些网络，实验在IWSLT2017数据集上进行。结果显示这些“无注意力Transformer”具有与原始架构相媲美的性能。通过严格的消融研究，并尝试不同替代网络类型和规模，我们提供支持我们方法可行性的见解。这不仅揭示了浅层前馈网络在模拟注意力机制方面的适应性，还强调了它们简化序列到序列任务复杂架构的潜力。

English

This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these "attentionless Transformers" to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.

重新思考注意力：探索浅层前馈神经网络作为变压器中注意力层的替代方案

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

摘要

Support