BOLT: 蒸留なしで言語モデル内の長い連鎖思考をブートストラップする

要旨

大規模言語モデル（LLMs）は、OpenAIのo1など、卓越した推論能力を示しています。o1は、問いに答える前に長い思考連鎖（LongCoT）を生成します。LongCoTにより、LLMsは問題を分析し、計画を立て、考えを巡らせ、効果的に戻ることができます。これらの行動により、LLMsは複雑な問題を解決する力を持ちます。o1のリリース後、多くのチームがそのLongCoTと推論能力を模倣しようと試みています。方法論に関しては、彼らは主に既存のLongCoT能力を持つモデル（例：OpenAI-o1、Qwen-QwQ、DeepSeek-R1-Preview）からのデータを用いた知識蒸留に依存しており、このような推論能力を体系的に開発する上で大きな不確実性が残っています。データ領域に関しては、これらの研究は主に数学に焦点を当てており、一部はコーディングも含んでいますが、一般化を制限しています。本論文では、LLMのLongCoT能力をo1のようなモデルからの蒸留や高コストの人間の注釈なしで可能にする新しいアプローチを紹介します。このアプローチでは、標準のinstructモデルからLongCoT（BOLT）をブートストラップします。BOLTには3つの段階があります：1）標準のinstructモデルでのコンテキスト学習によるLongCoTデータのブートストラップ；2）LongCoTの教師付きファインチューニング；3）LongCoT能力をさらに洗練するためのオンライントレーニング。BOLTでは、ブートストラップ段階でわずかなコンテキスト例を作成する必要があります。実験では、このアプローチの実現可能性を示すために10の例を作成しました。私たちは、Llama-3.1-70B-Instructを使用してLongCoTをブートストラップし、さまざまなモデルスケール（7B、8B、70B）に私たちの手法を適用します。私たちは、多様なタスク解決と推論能力を評価する様々なベンチマーク、Arena-Hard、MT-Bench、WildBench、ZebraLogic、MATH500で印象的なパフォーマンスを達成しています。

English

Large language models (LLMs), such as o1 from OpenAI, have demonstrated remarkable reasoning capabilities. o1 generates a long chain-of-thought (LongCoT) before answering a question. LongCoT allows LLMs to analyze problems, devise plans, reflect, and backtrack effectively. These actions empower LLM to solve complex problems. After the release of o1, many teams have attempted to replicate its LongCoT and reasoning capabilities. In terms of methods, they primarily rely on knowledge distillation with data from existing models with LongCoT capacities (e.g., OpenAI-o1, Qwen-QwQ, DeepSeek-R1-Preview), leaving significant uncertainties on systematically developing such reasoning abilities. In terms of data domains, these works focus narrowly on math while a few others include coding, limiting their generalizability. This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations, where we bootstrap LongCoT (BOLT) from a standard instruct model. BOLT involves three stages: 1) LongCoT data bootstrapping with in-context learning on a standard instruct model; 2) LongCoT supervised finetuning; 3) online training to further refine LongCoT capacities. In BOLT, only a few in-context examples need to be constructed during the bootstrapping stage; in our experiments, we created 10 examples, demonstrating the feasibility of this approach. We use Llama-3.1-70B-Instruct to bootstrap LongCoT and apply our method to various model scales (7B, 8B, 70B). We achieve impressive performance on a variety of benchmarks, Arena-Hard, MT-Bench, WildBench, ZebraLogic, MATH500, which evaluate diverse task-solving and reasoning capabilities.

BOLT: 蒸留なしで言語モデル内の長い連鎖思考をブートストラップする

BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation

要旨

Support