LLMtimesMapReduce：使用大型语言模型简化长序列处理

摘要

扩大大型语言模型（LLMs）的上下文窗口已成为一个关键的研究领域，特别是对涉及极长文本的应用而言。在这项工作中，我们提出了一个新颖的无需训练的框架，用于处理长文本，利用分而治之的策略实现全面的文档理解。所提出的LLMtimesMapReduce框架将整个文档分成几个块供LLMs阅读，然后聚合中间答案以生成最终输出。分而治之长文本处理框架的主要挑战在于在分割文档时存在丢失关键的长距离信息的风险，这可能导致模型基于分段文本生成不完整或不正确的答案。中断的长距离信息可分为两类：块间依赖和块间冲突。我们设计了一个结构化信息协议来更好地处理块间依赖，并设计了一个上下文置信度校准机制来解决块间冲突。实验结果表明，LLMtimesMapReduce能够胜过代表性的开源和商业长上下文LLMs，并适用于多种不同模型。

English

Enlarging the context window of large language models (LLMs) has become a crucial research area, particularly for applications involving extremely long texts. In this work, we propose a novel training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding. The proposed LLMtimesMapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output. The main challenge for divide-and-conquer long text processing frameworks lies in the risk of losing essential long-range information when splitting the document, which can lead the model to produce incomplete or incorrect answers based on the segmented texts. Disrupted long-range information can be classified into two categories: inter-chunk dependency and inter-chunk conflict. We design a structured information protocol to better cope with inter-chunk dependency and an in-context confidence calibration mechanism to resolve inter-chunk conflicts. Experimental results demonstrate that LLMtimesMapReduce can outperform representative open-source and commercial long-context LLMs, and is applicable to several different models.

LLMtimesMapReduce：使用大型语言模型简化长序列处理

LLMtimesMapReduce: Simplified Long-Sequence Processing using Large Language Models

摘要

Support