突破注意力瓶颈
Breaking the Attention Bottleneck
June 16, 2024
作者: Kalle Hilsenbek
cs.AI
摘要
基于注意力的Transformer已成为许多深度学习领域的标准架构,主要因为其能够建模长距离依赖关系并处理可变长度的输入序列。然而,由于其二次复杂度,注意力机制在Transformer架构中成为一个重要瓶颈。该算法在解码器中仅单向,并在过度参数化的仅解码器模型中收敛为静态模式。我通过开发一个作为注意力或激活替代的生成函数来解决这个问题。通过将每个标记与前一个标记进行比较,它仍保持自回归特性。在我的测试设置中,使用nanoGPT可以实现更小的模型同时产生更小的损失。通过合并平均上下文向量,损失进一步降低。这种注意力替代的概念在GNU AGPL v3许可下分发,网址为https://gitlab.com/Bachstelze/causal_generation。
English
Attention-based transformers have become the standard architecture in many
deep learning fields, primarily due to their ability to model long-range
dependencies and handle variable-length input sequences. However, the attention
mechanism with its quadratic complexity is a significant bottleneck in the
transformer architecture. This algorithm is only uni-directional in the decoder
and converges to a static pattern in over-parametrized decoder-only models. I
address this issue by developing a generative function as attention or
activation replacement. It still has the auto-regressive character by comparing
each token with the previous one. In my test setting with nanoGPT this yields a
smaller loss while having a smaller model. The loss further drops by
incorporating an average context vector. This concept of attention replacement
is distributed under the GNU AGPL v3 license at
https://gitlab.com/Bachstelze/causal_generation.Summary
AI-Generated Summary