分支-求解-合并改进了大型语言模型的评估和生成

摘要

大型语言模型（LLMs）经常用于涉及满足复杂用户约束条件或考虑多个方面和标准的多方面语言生成和评估任务。然而，由于模型缺乏连贯性和无法规划和分解问题，它们的性能可能不尽如人意。我们提出了Branch-Solve-Merge（BSM），这是一个用于应对这类具有挑战性的自然语言任务的大型语言模型程序（Schlag等人，2023年）。它由分支、求解和合并模块组成，这些模块使用特定提示对基本LLM进行参数化。这三个模块规划将任务分解为多个并行子任务，独立解决这些子任务，并将解决方案融合到子任务中。我们将我们的方法应用于LLM响应评估和受限文本生成任务，并使用多个LLMs（包括Vicuna、LLaMA-2-chat和GPT-4）评估其有效性。BSM通过提高人-LLM一致性，使每个LLM的评估正确性和一致性提高了高达26％，将长度和成对位置偏差降低了高达50％，并使LLaMA-2-chat在大多数领域能够与GPT-4相匹敌或胜过它。在约束故事生成任务中，BSM提高了故事的连贯性，同时将约束满足度提高了12％。

English

Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks that involve satisfying intricate user constraints or taking into account multiple aspects and criteria. However, their performance can fall short, due to the model's lack of coherence and inability to plan and decompose the problem. We propose Branch-Solve-Merge (BSM), a Large Language Model program (Schlag et al., 2023) for tackling such challenging natural language tasks. It consists of branch, solve, and merge modules that are parameterized with specific prompts to the base LLM. These three modules plan a decomposition of the task into multiple parallel sub-tasks, independently solve them, and fuse the solutions to the sub-tasks. We apply our method to the tasks of LLM response evaluation and constrained text generation and evaluate its effectiveness with multiple LLMs, including Vicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%, reducing length and pairwise position biases by up to 50%, and allowing LLaMA-2-chat to match or outperform GPT-4 on most domains. On the constraint story generation task, BSM improves the coherence of the stories while also improving constraint satisfaction by 12%.

分支-求解-合并改进了大型语言模型的评估和生成

Branch-Solve-Merge Improves Large Language Model Evaluation and Generation

摘要

Support