利用偏好树提升大型语言模型推理的通用性

摘要

我们推出了Eurus，这是一套针对推理优化的大型语言模型（LLMs）。Eurus模型从Mistral-7B和CodeLlama-70B微调而来，在涵盖数学、代码生成和逻辑推理问题的多样化基准测试中，其在开源模型中达到了最先进的成果。特别值得一提的是，Eurus-70B在涵盖五项任务的12项综合基准测试中，推理能力超越了GPT-3.5 Turbo，并在两个具有挑战性的基准测试——LeetCode和TheoremQA上，分别取得了33.3%和32.6%的pass@1准确率，显著超越现有开源模型的表现，优势超过13.3%。Eurus的强劲表现主要归功于UltraInteract，这是我们为复杂推理任务精心策划的大规模高质量对齐数据集。UltraInteract可用于监督微调及偏好学习。对于每项指令，它包含一个偏好树，其中包括（1）以统一格式呈现的多样化规划策略推理链，（2）与环境和批判的多轮交互轨迹，以及（3）促进偏好学习的成对数据。UltraInteract使我们能够深入探索推理任务的偏好学习。我们的研究揭示，一些在常规对话中表现良好的偏好学习算法，在推理任务中可能并不那么适用。受此启发，我们推导出一个新颖的奖励建模目标，结合UltraInteract，形成了一个强大的奖励模型。

English

We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.

利用偏好树提升大型语言模型推理的通用性

Advancing LLM Reasoning Generalists with Preference Trees

摘要

Support