溢出预防机制提升长上下文循环大语言模型性能

摘要

近期，大型语言模型（LLMs）的一个发展趋势是开发具有次二次复杂度的循环模型，以提升长上下文处理的效率。我们深入研究了当前领先的长上下文大模型，重点关注其固定大小的循环记忆如何影响性能。实验表明，即便这些模型在扩展上下文环境中进行了训练，它们对长上下文的使用仍显不足。具体而言，我们展示了一种基于分块的推理方法，该方法通过识别并仅处理输入中最相关的部分，能够有效缓解循环记忆失效问题，并在众多长上下文任务中表现出色：在LongBench基准测试中，我们的方法使Falcon3-Mamba-Inst-7B的整体性能提升了14%，Falcon-Mamba-Inst-7B提升了28%，RecurrentGemma-IT-9B提升了50%，RWKV6-Finch-7B提升了51%。令人惊讶的是，这一简单策略在极具挑战性的LongBench v2基准测试中也取得了顶尖成绩，与同等规模的Transformer模型相比展现了竞争力。此外，我们的发现引发了对循环模型是否真正利用了长距离依赖关系的质疑，因为我们的单分块策略即便在那些理论上需要跨上下文关系的任务中，也展现出了更强的性能。

English

A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.

溢出预防机制提升长上下文循环大语言模型性能

Overflow Prevention Enhances Long-Context Recurrent LLMs

摘要

Support