利用偏好樹推進LLM推理通才

摘要

我們介紹了Eurus，一套針對推理進行優化的大型語言模型（LLMs）。從Mistral-7B和CodeLlama-70B進行微調後，Eurus模型在涵蓋數學、代碼生成和邏輯推理問題的各種基準測試中取得了開源模型的最新成果。值得注意的是，Eurus-70B在通過12個測試涵蓋五個任務的全面基準測試中，在推理方面擊敗了GPT-3.5 Turbo，並在LeetCode和TheoremQA這兩個具有挑戰性的基準測試中實現了33.3%和32.6%的一次通過準確率，遠遠優於現有開源模型超過13.3%的差距。Eurus的優異表現主要歸因於UltraInteract，我們新編制的大規模高質量對齊數據集，專門為複雜推理任務而設計。UltraInteract可用於監督微調和偏好學習。對於每個指令，它包括一個偏好樹，其中包含統一格式的多樣化規劃策略的推理鏈、與環境和評論的多輪交互軌跡，以及促進偏好學習的成對數據。UltraInteract使我們能夠深入探索推理任務的偏好學習。我們的研究表明，一些成熟的偏好學習算法在推理任務中可能不如它們在一般對話中的有效性。受此啟發，我們提出了一個新穎的獎勵建模目標，該目標連同UltraInteract，導致一個強大的獎勵模型。

English

We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.

利用偏好樹推進LLM推理通才

Advancing LLM Reasoning Generalists with Preference Trees

摘要

Support