突破注意力瓶颈

摘要

基于注意力的Transformer已成为许多深度学习领域的标准架构，主要因为其能够建模长距离依赖关系并处理可变长度的输入序列。然而，由于其二次复杂度，注意力机制在Transformer架构中成为一个重要瓶颈。该算法在解码器中仅单向，并在过度参数化的仅解码器模型中收敛为静态模式。我通过开发一个作为注意力或激活替代的生成函数来解决这个问题。通过将每个标记与前一个标记进行比较，它仍保持自回归特性。在我的测试设置中，使用nanoGPT可以实现更小的模型同时产生更小的损失。通过合并平均上下文向量，损失进一步降低。这种注意力替代的概念在GNU AGPL v3许可下分发，网址为https://gitlab.com/Bachstelze/causal_generation。

English

Attention-based transformers have become the standard architecture in many deep learning fields, primarily due to their ability to model long-range dependencies and handle variable-length input sequences. However, the attention mechanism with its quadratic complexity is a significant bottleneck in the transformer architecture. This algorithm is only uni-directional in the decoder and converges to a static pattern in over-parametrized decoder-only models. I address this issue by developing a generative function as attention or activation replacement. It still has the auto-regressive character by comparing each token with the previous one. In my test setting with nanoGPT this yields a smaller loss while having a smaller model. The loss further drops by incorporating an average context vector. This concept of attention replacement is distributed under the GNU AGPL v3 license at https://gitlab.com/Bachstelze/causal_generation.

突破注意力瓶颈

Breaking the Attention Bottleneck

摘要

Summary

Support

Support