邊緣寫作：長文本檢索的更佳推論模式

摘要

本文介紹了邊緣書寫（Writing in the Margins，WiM），這是一種針對大型語言模型設計的新推論模式，旨在優化處理檢索導向任務中的長輸入序列。該方法利用分塊預填充的鍵-值緩存來執行分段式推論，從而實現對廣泛上下文的高效處理，並生成和分類中間信息（“邊緣”），引導模型朝向特定任務。此方法在略微增加計算開銷的同時，顯著提高了現成模型的性能，而無需進行微調。具體而言，我們觀察到WiM平均提高了7.5%的推理技能準確性（HotpotQA，MultiHop-RAG），以及聚合任務（CWE）的F1分數增加超過30.0%。此外，我們展示了所提出的模式如何適應互動式檢索設計，為最終用戶提供有關上下文處理進度的持續更新，並准確指出相關信息如何整合到最終回應中。我們在https://github.com/writer/writing-in-the-margins 上使用Hugging Face Transformers庫釋出了WiM的實現。

English

In this paper, we introduce Writing in the Margins (WiM), a new inference pattern for Large Language Models designed to optimize the handling of long input sequences in retrieval-oriented tasks. This approach leverages the chunked prefill of the key-value cache to perform segment-wise inference, which enables efficient processing of extensive contexts along with the generation and classification of intermediate information ("margins") that guide the model towards specific tasks. This method increases computational overhead marginally while significantly enhancing the performance of off-the-shelf models without the need for fine-tuning. Specifically, we observe that WiM provides an average enhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG) and more than a 30.0% increase in the F1-score for aggregation tasks (CWE). Additionally, we show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing, and pinpoints the integration of relevant information into the final response. We release our implementation of WiM using Hugging Face Transformers library at https://github.com/writer/writing-in-the-margins.

邊緣寫作：長文本檢索的更佳推論模式

Writing in the Margins: Better Inference Pattern for Long Context Retrieval

摘要

Support