利用偏好树提升大型语言模型推理的通用性
Advancing LLM Reasoning Generalists with Preference Trees
April 2, 2024
作者: Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun
cs.AI
摘要
我们推出了Eurus,这是一套针对推理优化的大型语言模型(LLMs)。Eurus模型从Mistral-7B和CodeLlama-70B微调而来,在涵盖数学、代码生成和逻辑推理问题的多样化基准测试中,其在开源模型中达到了最先进的成果。特别值得一提的是,Eurus-70B在涵盖五项任务的12项综合基准测试中,推理能力超越了GPT-3.5 Turbo,并在两个具有挑战性的基准测试——LeetCode和TheoremQA上,分别取得了33.3%和32.6%的pass@1准确率,显著超越现有开源模型的表现,优势超过13.3%。Eurus的强劲表现主要归功于UltraInteract,这是我们为复杂推理任务精心策划的大规模高质量对齐数据集。UltraInteract可用于监督微调及偏好学习。对于每项指令,它包含一个偏好树,其中包括(1)以统一格式呈现的多样化规划策略推理链,(2)与环境和批判的多轮交互轨迹,以及(3)促进偏好学习的成对数据。UltraInteract使我们能够深入探索推理任务的偏好学习。我们的研究揭示,一些在常规对话中表现良好的偏好学习算法,在推理任务中可能并不那么适用。受此启发,我们推导出一个新颖的奖励建模目标,结合UltraInteract,形成了一个强大的奖励模型。
English
We introduce Eurus, a suite of large language models (LLMs) optimized for
reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve
state-of-the-art results among open-source models on a diverse set of
benchmarks covering mathematics, code generation, and logical reasoning
problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a
comprehensive benchmarking across 12 tests covering five tasks, and achieves a
33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging
benchmarks, substantially outperforming existing open-source models by margins
more than 13.3%. The strong performance of Eurus can be primarily attributed to
UltraInteract, our newly-curated large-scale, high-quality alignment dataset
specifically designed for complex reasoning tasks. UltraInteract can be used in
both supervised fine-tuning and preference learning. For each instruction, it
includes a preference tree consisting of (1) reasoning chains with diverse
planning strategies in a unified format, (2) multi-turn interaction
trajectories with the environment and the critique, and (3) pairwise data to
facilitate preference learning. UltraInteract allows us to conduct an in-depth
exploration of preference learning for reasoning tasks. Our investigation
reveals that some well-established preference learning algorithms may be less
suitable for reasoning tasks compared to their effectiveness in general
conversations. Inspired by this, we derive a novel reward modeling objective
which, together with UltraInteract, leads to a strong reward model.Summary
AI-Generated Summary