ReasonFlux：通過擴展思維模板進行層次化LLM推理

摘要

我們提出，通過擴展思維模板的層次式LLM推理，能夠有效優化推理搜索空間，並且勝過OpenAI o1-preview和DeepSeek V3等強大LLM的數學推理能力。我們使用僅8個GPU訓練了我們的ReasonFlux-32B模型，並引入了三項創新：(i) 一個結構化且通用的思維模板庫，包含約500個高層次思維模板，能夠泛化到類似或相關的推理問題；(ii) 在一系列思維模板上執行層次式強化學習，而不是長CoTs，優化基礎LLM以規劃出處理逐漸複雜問題的最佳模板軌跡；(iii) 一個全新的推理擴展系統，能夠在推理時自適應地擴展思維模板，實現層次式LLM推理。通過包含連續思維模板的模板軌跡，我們的ReasonFlux-32B顯著提升了數學推理能力至最先進水平。值得注意的是，在MATH基準測試中，它達到了91.2%的準確率，比o1-preview高出6.7%。在美國數學奧林匹克（AIME）基準測試中，ReasonFlux-32B解決了平均56.7%的問題，分別超過o1-preview和DeepSeek-V3的27%和45%。代碼：https://github.com/Gen-Verse/ReasonFlux

English

We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing sequential thought templates, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux

ReasonFlux：通過擴展思維模板進行層次化LLM推理

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates

摘要

Support