BOLT：在語言模型中進行長鏈式思維引導的啟動，無需蒸餾

BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation

February 6, 2025

作者: Bo Pang, Hanze Dong, Jiacheng Xu, Silvio Savarese, Yingbo Zhou, Caiming Xiong

cs.AI

摘要

大型語言模型（LLMs），如OpenAI的o1，展示了卓越的推理能力。o1在回答問題前生成了一個長的思維鏈（LongCoT）。LongCoT使得LLMs能夠有效地分析問題、制定計劃、反思和回溯。這些行為使LLM能夠解決複雜問題。在o1發布後，許多團隊試圖複製其LongCoT和推理能力。在方法上，他們主要依賴於知識蒸餾，使用來自具有LongCoT能力的現有模型的數據（例如OpenAI-o1、Qwen-QwQ、DeepSeek-R1-Preview），這在系統性地發展這種推理能力方面存在著重大的不確定性。在數據領域方面，這些工作主要集中在數學上，而少數其他工作包括編碼，限制了其泛化能力。本文介紹了一種新方法，可以使LLM具有LongCoT能力，而無需從類似o1的模型或昂貴的人類標註中蒸餾，我們從標準指導模型中引導LongCoT（BOLT）。BOLT包括三個階段：1）通過在標準指導模型上進行上下文學習引導LongCoT數據；2）LongCoT監督微調；3）在線訓練以進一步完善LongCoT能力。在BOLT中，只需要在引導階段構建幾個上下文示例；在我們的實驗中，我們創建了10個示例，展示了這種方法的可行性。我們使用Llama-3.1-70B-Instruct來引導LongCoT，並將我們的方法應用於各種模型規模（7B、8B、70B）。我們在各種基準測試中取得了令人印象深刻的表現，包括Arena-Hard、MT-Bench、WildBench、ZebraLogic、MATH500，這些測試評估了不同任務解決和推理能力。

English

Large language models (LLMs), such as o1 from OpenAI, have demonstrated remarkable reasoning capabilities. o1 generates a long chain-of-thought (LongCoT) before answering a question. LongCoT allows LLMs to analyze problems, devise plans, reflect, and backtrack effectively. These actions empower LLM to solve complex problems. After the release of o1, many teams have attempted to replicate its LongCoT and reasoning capabilities. In terms of methods, they primarily rely on knowledge distillation with data from existing models with LongCoT capacities (e.g., OpenAI-o1, Qwen-QwQ, DeepSeek-R1-Preview), leaving significant uncertainties on systematically developing such reasoning abilities. In terms of data domains, these works focus narrowly on math while a few others include coding, limiting their generalizability. This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations, where we bootstrap LongCoT (BOLT) from a standard instruct model. BOLT involves three stages: 1) LongCoT data bootstrapping with in-context learning on a standard instruct model; 2) LongCoT supervised finetuning; 3) online training to further refine LongCoT capacities. In BOLT, only a few in-context examples need to be constructed during the bootstrapping stage; in our experiments, we created 10 examples, demonstrating the feasibility of this approach. We use Llama-3.1-70B-Instruct to bootstrap LongCoT and apply our method to various model scales (7B, 8B, 70B). We achieve impressive performance on a variety of benchmarks, Arena-Hard, MT-Bench, WildBench, ZebraLogic, MATH500, which evaluate diverse task-solving and reasoning capabilities.