边缘写作：长上下文检索的更好推理模式

摘要

本文介绍了边缘书写（WiM），这是一种为大型语言模型设计的新推理模式，旨在优化检索导向任务中长输入序列的处理。该方法利用分块预填充的键-值缓存来执行分段推理，从而实现对广泛上下文的高效处理，同时生成和分类中间信息（“边缘”），以引导模型朝向特定任务。这种方法在略微增加计算开销的同时，显著提升了现成模型的性能，无需进行微调。具体来说，我们观察到WiM对推理技能（HotpotQA，MultiHop-RAG）的准确性平均提升了7.5％，对聚合任务（CWE）的F1分数提升超过30.0％。此外，我们展示了所提出的模式如何融入交互式检索设计，为最终用户提供有关上下文处理进展的持续更新，并准确定位相关信息集成到最终响应中。我们在https://github.com/writer/writing-in-the-margins 上发布了WiM的实现，使用了Hugging Face Transformers库。

English

In this paper, we introduce Writing in the Margins (WiM), a new inference pattern for Large Language Models designed to optimize the handling of long input sequences in retrieval-oriented tasks. This approach leverages the chunked prefill of the key-value cache to perform segment-wise inference, which enables efficient processing of extensive contexts along with the generation and classification of intermediate information ("margins") that guide the model towards specific tasks. This method increases computational overhead marginally while significantly enhancing the performance of off-the-shelf models without the need for fine-tuning. Specifically, we observe that WiM provides an average enhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG) and more than a 30.0% increase in the F1-score for aggregation tasks (CWE). Additionally, we show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing, and pinpoints the integration of relevant information into the final response. We release our implementation of WiM using Hugging Face Transformers library at https://github.com/writer/writing-in-the-margins.

边缘写作：长上下文检索的更好推理模式

Writing in the Margins: Better Inference Pattern for Long Context Retrieval

摘要

Support