Vcc：通過優先考慮重要標記，將Transformer擴展至128K標記或更多

摘要

Transformer 模型是自然語言處理（NLP）和計算機視覺的基礎。儘管近期有許多研究致力於降低這類模型的二次成本（作為序列長度 n 的函數），但有效處理超長序列（例如，超過 16K 個標記）仍然具有挑戰性。例如，基於整本書回答問題或總結科學文章等應用效率低下或不可行。在本文中，我們提出通過將輸入壓縮成在每一層都與 n 無關的表示（其大小為 r），顯著降低 Transformer 模型複雜度對 n 的依賴性。具體來說，通過利用許多任務中僅有的一小部分特殊標記（我們稱之為 VIP 標記）對最終預測最具相關性的事實，我們提出了一種 VIP 標記中心壓縮（Vcc）方案，根據它們對近似這些 VIP 標記表示的影響，有選擇性地壓縮輸入序列。與競爭基準相比，所提出的算法不僅高效（在 4K 和 16K 長度上相比基準實現了 3 倍以上的效率改進），而且在大量任務上實現了競爭性或更好的性能。此外，我們展示了我們的算法可以擴展到 128K 個標記（或更多），同時持續提供準確性改進。

English

Transformer models are foundational to natural language processing (NLP) and computer vision. Despite various recent works devoted to reducing the quadratic cost of such models (as a function of the sequence length n), dealing with ultra long sequences efficiently (e.g., with more than 16K tokens) remains challenging. Applications such as answering questions based on an entire book or summarizing a scientific article are inefficient or infeasible. In this paper, we propose to significantly reduce the dependency of a Transformer model's complexity on n, by compressing the input into a representation whose size r is independent of n at each layer. Specifically, by exploiting the fact that in many tasks, only a small subset of special tokens (we call VIP-tokens) are most relevant to the final prediction, we propose a VIP-token centric compression (Vcc) scheme which selectively compresses the input sequence based on their impact on approximating the representation of these VIP-tokens. Compared with competitive baselines, the proposed algorithm not only is efficient (achieving more than 3times efficiency improvement compared to baselines on 4K and 16K lengths), but also achieves competitive or better performance on a large number of tasks. Further, we show that our algorithm can be scaled to 128K tokens (or more) while consistently offering accuracy improvement.

Vcc：通過優先考慮重要標記，將Transformer擴展至128K標記或更多

Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

摘要

Support