大型语言模型可以在长文本推理中自我改进。

摘要

大型语言模型（LLMs）在处理长文本方面取得了显著进展，但在长文本推理方面仍然存在困难。现有方法通常涉及使用合成数据对LLMs进行微调，这取决于人类专家或类似GPT-4的先进模型的注释，从而限制了进一步的发展。为解决这一问题，我们研究了LLMs在长文本推理中自我改进的潜力，并提出了\ours，这是一种专为此目的设计的方法。这种方法很直接：我们为每个问题采样多个输出，用最小贝叶斯风险对它们进行评分，然后基于这些输出进行监督微调或偏好优化。对几种领先的LLMs进行了大量实验，证明了\ours的有效性，对于Llama-3.1-8B-Instruct，绝对改进了4.2个点。此外，\ours相比依赖人类专家或先进模型生成的数据的先前方法表现出更优越的性能。我们预计这项工作将为长文本场景中的自我改进技术开辟新途径，这对LLMs的持续发展至关重要。

English

Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of 4.2 points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.

大型语言模型可以在长文本推理中自我改进。

Large Language Models Can Self-Improve in Long-context Reasoning

摘要

Support