LLMtimesMapReduce: 大規模言語モデルを用いた長いシーケンス処理の簡略化

要旨

大規模言語モデル（LLMs）のコンテキストウィンドウを拡大することは、特に非常に長いテキストを扱うアプリケーションにとって重要な研究分野となっています。本研究では、長いテキストを処理するための新しいトレーニングフリーのフレームワークを提案し、包括的なドキュメント理解を実現するために分割統治戦略を活用しています。提案されたLLMtimesMapReduceフレームワークは、LLMsが読むためにドキュメント全体をいくつかのチャンクに分割し、中間の回答を集約して最終出力を生成します。分割統治長文処理フレームワークの主要な課題は、ドキュメントを分割する際に重要な長距離情報が失われるリスクにあり、これによりモデルがセグメント化されたテキストに基づいて不完全または不正確な回答を生成する可能性があります。中断された長距離情報は、チャンク間依存性とチャンク間の衝突の2つのカテゴリに分類されます。我々は、チャンク間依存性に対処するための構造化された情報プロトコルを設計し、チャンク間の衝突を解決するためのインコンテキスト信頼度補正メカニズムを提案しています。実験結果は、LLMtimesMapReduceが代表的なオープンソースおよび商用の長いコンテキストLLMsを上回ることを示し、さまざまなモデルに適用可能であることを示しています。

English

Enlarging the context window of large language models (LLMs) has become a crucial research area, particularly for applications involving extremely long texts. In this work, we propose a novel training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding. The proposed LLMtimesMapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output. The main challenge for divide-and-conquer long text processing frameworks lies in the risk of losing essential long-range information when splitting the document, which can lead the model to produce incomplete or incorrect answers based on the segmented texts. Disrupted long-range information can be classified into two categories: inter-chunk dependency and inter-chunk conflict. We design a structured information protocol to better cope with inter-chunk dependency and an in-context confidence calibration mechanism to resolve inter-chunk conflicts. Experimental results demonstrate that LLMtimesMapReduce can outperform representative open-source and commercial long-context LLMs, and is applicable to several different models.

LLMtimesMapReduce: 大規模言語モデルを用いた長いシーケンス処理の簡略化

LLMtimesMapReduce: Simplified Long-Sequence Processing using Large Language Models

要旨

Support