大規模言語モデルは長い文脈推論において自己改善が可能です。

要旨

大規模言語モデル（LLMs）は、長い文脈の処理において著しい進歩を達成していますが、依然として長い文脈の推論には苦労しています。既存のアプローチは、通常、人間の専門家やGPT-4のような高度なモデルからの注釈に依存する合成データを使用してLLMsを微調整することに関与しており、これによりさらなる進歩が制限されています。この問題に対処するために、私たちはLLMsが長い文脈の推論において自己改善する可能性を調査し、この目的に特化したアプローチである「\ours」を提案します。このアプローチは直感的です：各質問に対して複数の出力をサンプリングし、それらを最小ベイズリスクでスコア付けし、その後、これらの出力に基づいて教師付き微調整または選好最適化を適用します。いくつかの主要なLLMsでの包括的な実験は、\oursの有効性を示し、Llama-3.1-8B-Instructにおいて4.2ポイントの絶対的な改善を達成しています。さらに、\oursは、人間の専門家や高度なモデルによって生成されたデータに依存する従来のアプローチと比較して、優れたパフォーマンスを達成しています。この研究がLLMsの持続的な進歩に不可欠な長い文脈のシナリオにおける自己改善技術の新たな展開を切り開くことが期待されます。

English

Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of 4.2 points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.

大規模言語モデルは長い文脈推論において自己改善が可能です。

Large Language Models Can Self-Improve in Long-context Reasoning

要旨

Support