TRAMS：用于长距离语言建模的无训练记忆选择

摘要

Transformer架构对许多人工智能模型至关重要，但在长距离语言建模方面仍面临挑战。尽管已经设计了几种特定的Transformer架构来解决长距离依赖性问题，但现有方法如Transformer-XL存在大量无效记忆的问题。在本研究中，我们提出了一种即插即用的策略，称为无需训练的记忆选择（TRAMS），根据一个简单的度量选择参与注意力计算的标记。这种策略使我们能够保留那些可能与当前查询具有高注意力得分的标记，并忽略其他标记。我们在单词级基准（WikiText-103）和字符级基准（enwik8）上测试了我们的方法，结果表明在没有额外训练或添加额外参数的情况下取得了改进。

English

The Transformer architecture is crucial for numerous AI models, but it still faces challenges in long-range language modeling. Though several specific transformer architectures have been designed to tackle issues of long-range dependencies, existing methods like Transformer-XL are plagued by a high percentage of ineffective memories. In this study, we present a plug-and-play strategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens participating in attention calculation based on one simple metric. This strategy allows us to keep tokens that are likely to have a high attention score with the current queries and ignore the other ones. We have tested our approach on the word-level benchmark (WikiText-103) and the character-level benchmark (enwik8), and the results indicate an improvement without having additional training or adding additional parameters.

TRAMS：用于长距离语言建模的无训练记忆选择

TRAMS: Training-free Memory Selection for Long-range Language Modeling

摘要

Support