Vcc：通过优先考虑重要标记将Transformer扩展到128K标记或更多

摘要

Transformer模型是自然语言处理（NLP）和计算机视觉的基础。尽管近年来有许多研究致力于降低这类模型的二次成本（作为序列长度n的函数），但有效处理超长序列（例如超过16K个标记）仍然具有挑战性。诸如基于整本书回答问题或总结科学文章等应用是低效或不可行的。在本文中，我们提出通过将输入压缩成一个在每一层中与n无关的表示（大小为r），显著减少Transformer模型复杂度对n的依赖性。具体地，通过利用许多任务中仅一小部分特殊标记（我们称之为VIP标记）对最终预测最相关的事实，我们提出了一种VIP标记为中心的压缩（Vcc）方案，该方案根据它们对近似这些VIP标记表示的影响有选择性地压缩输入序列。与竞争基线相比，所提出的算法不仅高效（在4K和16K长度上相比基线实现了3倍以上的效率改进），而且在大量任务上实现了具有竞争力或更好的性能。此外，我们展示了我们的算法可以扩展到128K个标记（或更多），同时持续提供准确性改进。

English

Transformer models are foundational to natural language processing (NLP) and computer vision. Despite various recent works devoted to reducing the quadratic cost of such models (as a function of the sequence length n), dealing with ultra long sequences efficiently (e.g., with more than 16K tokens) remains challenging. Applications such as answering questions based on an entire book or summarizing a scientific article are inefficient or infeasible. In this paper, we propose to significantly reduce the dependency of a Transformer model's complexity on n, by compressing the input into a representation whose size r is independent of n at each layer. Specifically, by exploiting the fact that in many tasks, only a small subset of special tokens (we call VIP-tokens) are most relevant to the final prediction, we propose a VIP-token centric compression (Vcc) scheme which selectively compresses the input sequence based on their impact on approximating the representation of these VIP-tokens. Compared with competitive baselines, the proposed algorithm not only is efficient (achieving more than 3times efficiency improvement compared to baselines on 4K and 16K lengths), but also achieves competitive or better performance on a large number of tasks. Further, we show that our algorithm can be scaled to 128K tokens (or more) while consistently offering accuracy improvement.

Vcc：通过优先考虑重要标记将Transformer扩展到128K标记或更多

Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

摘要

Support