專注Transformer：對比訓練用於上下文尺度調整

摘要

大型語言模型具有卓越的能力以上下文方式納入新資訊。然而，這種方法的完整潛力通常受到有效上下文長度的限制。解決這個問題的一種方法是賦予注意力層訪問外部記憶的能力，該記憶包含（鍵，值）對。然而，隨著文件數量的增加，相對於無關鍵而言，相關鍵的比例會降低，使模型更多地專注於無關鍵。我們確認了一個重要挑戰，稱為分心問題，其中與不同語義值相關聯的鍵可能重疊，使它們難以區分。為了應對這個問題，我們引入了專注Transformer（FoT），這是一種採用對比學習靈感的訓練過程的技術。這種新方法增強了（鍵，值）空間的結構，實現了上下文長度的延伸。我們的方法允許對現有的大型模型進行微調，以延長它們的有效上下文。透過我們對3B和7B OpenLLaMA檢查點的微調來證明這一點。由此產生的模型，我們稱之為LongLLaMA，在需要長上下文的任務中取得了進展。我們進一步說明，我們的LongLLaMA模型能夠熟練地管理256k上下文長度以進行密碼檢索。

English

Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of 3B and 7B OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a 256 k context length for passkey retrieval.

專注Transformer：對比訓練用於上下文尺度調整

Focused Transformer: Contrastive Training for Context Scaling

摘要

Support