突破注意力瓶頸

摘要

基於注意力的Transformer已成為許多深度學習領域的標準架構，主要是因為它們能夠建模長距離依賴關係並處理可變長度的輸入序列。然而，由於其二次複雜度，注意力機制在Transformer架構中成為一個重要的瓶頸。該算法在解碼器中僅單向運作，並在過度參數化的僅解碼器模型中收斂為靜態模式。我通過開發一個生成函數作為注意力或激活替換來解決這個問題。通過將每個標記與前一個標記進行比較，它仍然具有自回歸特性。在我使用nanoGPT進行的測試設置中，這導致了更小的損失，同時擁有更小的模型。通過結合平均上下文向量，損失進一步降低。這種注意力替換的概念在GNU AGPL v3許可下分佈在https://gitlab.com/Bachstelze/causal_generation。

English

Attention-based transformers have become the standard architecture in many deep learning fields, primarily due to their ability to model long-range dependencies and handle variable-length input sequences. However, the attention mechanism with its quadratic complexity is a significant bottleneck in the transformer architecture. This algorithm is only uni-directional in the decoder and converges to a static pattern in over-parametrized decoder-only models. I address this issue by developing a generative function as attention or activation replacement. It still has the auto-regressive character by comparing each token with the previous one. In my test setting with nanoGPT this yields a smaller loss while having a smaller model. The loss further drops by incorporating an average context vector. This concept of attention replacement is distributed under the GNU AGPL v3 license at https://gitlab.com/Bachstelze/causal_generation.

突破注意力瓶頸

Breaking the Attention Bottleneck

摘要

Support