不留下任何上下文：具有無限關注的高效無限上下文Transformer。

摘要

本研究介紹了一種有效的方法，用於將基於Transformer的大型語言模型（LLMs）擴展到無限長的輸入，並限制記憶體和計算。我們提出方法的關鍵組成部分是一種名為Infini-attention的新型注意力技術。Infini-attention將一種壓縮記憶體引入到基本注意力機制中，並在單個Transformer塊中結合了遮罩本地注意力和長期線性注意力機制。我們在長文本語言建模基準、100萬序列長度的密碼鎖定內容檢索和50萬長度的書籍摘要任務上展示了我們方法的有效性，使用10億和80億個LLMs。我們的方法引入了最小的有界記憶體參數，並實現了LLMs的快速流式推理。

English

This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.

不留下任何上下文：具有無限關注的高效無限上下文Transformer。

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

摘要

Support