LLMtimesMapReduce：使用大型語言模型簡化長序列處理

摘要

擴大大型語言模型（LLMs）的上下文窗口已成為一個至關重要的研究領域，特別是對於涉及極長文本的應用。在這項工作中，我們提出了一個新穎的無需訓練的框架，用於處理長文本，利用分治策略實現全面的文檔理解。所提出的LLMtimesMapReduce框架將整個文檔分為多個片段供LLMs閱讀，然後聚合中間答案以生成最終輸出。分治長文本處理框架的主要挑戰在於在分割文檔時有可能丟失關鍵的長程信息，這可能導致模型基於分段文本生成不完整或不正確的答案。中斷的長程信息可分為兩類：片間依賴性和片間衝突。我們設計了一個結構化信息協議來更好地應對片間依賴性，並設計了一個上下文信心校準機制來解決片間衝突。實驗結果表明，LLMtimesMapReduce能夠優於代表性的開源和商業長上下文LLMs，並且適用於多種不同的模型。

English

Enlarging the context window of large language models (LLMs) has become a crucial research area, particularly for applications involving extremely long texts. In this work, we propose a novel training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding. The proposed LLMtimesMapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output. The main challenge for divide-and-conquer long text processing frameworks lies in the risk of losing essential long-range information when splitting the document, which can lead the model to produce incomplete or incorrect answers based on the segmented texts. Disrupted long-range information can be classified into two categories: inter-chunk dependency and inter-chunk conflict. We design a structured information protocol to better cope with inter-chunk dependency and an in-context confidence calibration mechanism to resolve inter-chunk conflicts. Experimental results demonstrate that LLMtimesMapReduce can outperform representative open-source and commercial long-context LLMs, and is applicable to several different models.

LLMtimesMapReduce：使用大型語言模型簡化長序列處理

LLMtimesMapReduce: Simplified Long-Sequence Processing using Large Language Models

摘要

Support