ChatPaper.aiChatPaper

利用偏好树提升大型语言模型推理的通用性

Advancing LLM Reasoning Generalists with Preference Trees

April 2, 2024
作者: Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

我们推出了Eurus,这是一套针对推理优化的大型语言模型(LLMs)。Eurus模型从Mistral-7B和CodeLlama-70B微调而来,在涵盖数学、代码生成和逻辑推理问题的多样化基准测试中,其在开源模型中达到了最先进的成果。特别值得一提的是,Eurus-70B在涵盖五项任务的12项综合基准测试中,推理能力超越了GPT-3.5 Turbo,并在两个具有挑战性的基准测试——LeetCode和TheoremQA上,分别取得了33.3%和32.6%的pass@1准确率,显著超越现有开源模型的表现,优势超过13.3%。Eurus的强劲表现主要归功于UltraInteract,这是我们为复杂推理任务精心策划的大规模高质量对齐数据集。UltraInteract可用于监督微调及偏好学习。对于每项指令,它包含一个偏好树,其中包括(1)以统一格式呈现的多样化规划策略推理链,(2)与环境和批判的多轮交互轨迹,以及(3)促进偏好学习的成对数据。UltraInteract使我们能够深入探索推理任务的偏好学习。我们的研究揭示,一些在常规对话中表现良好的偏好学习算法,在推理任务中可能并不那么适用。受此启发,我们推导出一个新颖的奖励建模目标,结合UltraInteract,形成了一个强大的奖励模型。
English
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.

Summary

AI-Generated Summary

PDF472November 26, 2024