利用偏好樹推進LLM推理通才
Advancing LLM Reasoning Generalists with Preference Trees
April 2, 2024
作者: Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun
cs.AI
摘要
我們介紹了Eurus,一套針對推理進行優化的大型語言模型(LLMs)。從Mistral-7B和CodeLlama-70B進行微調後,Eurus模型在涵蓋數學、代碼生成和邏輯推理問題的各種基準測試中取得了開源模型的最新成果。值得注意的是,Eurus-70B在通過12個測試涵蓋五個任務的全面基準測試中,在推理方面擊敗了GPT-3.5 Turbo,並在LeetCode和TheoremQA這兩個具有挑戰性的基準測試中實現了33.3%和32.6%的一次通過準確率,遠遠優於現有開源模型超過13.3%的差距。Eurus的優異表現主要歸因於UltraInteract,我們新編制的大規模高質量對齊數據集,專門為複雜推理任務而設計。UltraInteract可用於監督微調和偏好學習。對於每個指令,它包括一個偏好樹,其中包含統一格式的多樣化規劃策略的推理鏈、與環境和評論的多輪交互軌跡,以及促進偏好學習的成對數據。UltraInteract使我們能夠深入探索推理任務的偏好學習。我們的研究表明,一些成熟的偏好學習算法在推理任務中可能不如它們在一般對話中的有效性。受此啟發,我們提出了一個新穎的獎勵建模目標,該目標連同UltraInteract,導致一個強大的獎勵模型。
English
We introduce Eurus, a suite of large language models (LLMs) optimized for
reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve
state-of-the-art results among open-source models on a diverse set of
benchmarks covering mathematics, code generation, and logical reasoning
problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a
comprehensive benchmarking across 12 tests covering five tasks, and achieves a
33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging
benchmarks, substantially outperforming existing open-source models by margins
more than 13.3%. The strong performance of Eurus can be primarily attributed to
UltraInteract, our newly-curated large-scale, high-quality alignment dataset
specifically designed for complex reasoning tasks. UltraInteract can be used in
both supervised fine-tuning and preference learning. For each instruction, it
includes a preference tree consisting of (1) reasoning chains with diverse
planning strategies in a unified format, (2) multi-turn interaction
trajectories with the environment and the critique, and (3) pairwise data to
facilitate preference learning. UltraInteract allows us to conduct an in-depth
exploration of preference learning for reasoning tasks. Our investigation
reveals that some well-established preference learning algorithms may be less
suitable for reasoning tasks compared to their effectiveness in general
conversations. Inspired by this, we derive a novel reward modeling objective
which, together with UltraInteract, leads to a strong reward model.Summary
AI-Generated Summary