PromptCoT 2.0: 大規模言語モデルの推論のためのプロンプト合成のスケーリング

要旨

大規模言語モデル（LLMs）は、会話システムから、オリンピック数学や競技プログラミングなどのタスクに対する強力な推論システムへと進化しています。パラメータのスケーリングやテスト時の計算量の増加が進歩を牽引してきましたが、主要なボトルネックは高品質な訓練問題の不足です。人間が手作業で作成したデータセットはコストが高く限られており、既存の合成コーパスはしばしば簡単すぎるか範囲が狭すぎます。PromptCoT 1.0では、プロンプト合成に根拠を注入することで問題の難易度を上げることが示されました。これを基に、我々はPromptCoT 2.0を提案します。これは、手作業のヒューリスティックを期待値最大化（EM）ループに置き換えるスケーラブルなフレームワークであり、根拠を反復的に洗練してプロンプト構築を導きます。これにより、従来のコーパスよりも難しく多様な問題が生成されます。合成プロンプトは、2つのポストトレーニング体制をサポートします：（1）セルフプレイ、ここでは強力なモデルがより強力な教師なしで検証可能なフィードバックを通じて自律的に改善します；（2）教師ありファインチューニング（SFT）、ここでは弱いモデルが教師によって蒸留されたトレースから学習します。広範な実験により、このアプローチの有効性が実証されています。セルフプレイでは、PromptCoT 2.0をQwen3-30B-A3B-Thinking-2507に適用することで、30Bスケールでの最新の結果を達成し、AIME 24/25とHMMT 25でそれぞれ+4.4、+4.8、+5.3、LiveCodeBench v5/v6で+6.1と+5.0、Codeforcesで+35 Eloを記録しました。SFTでは、Qwen2.5-7B-Instructを合成プロンプトのみで訓練することで、AIME 24で73.1、AIME 25で65.6、LiveCodeBench v5で53.4の精度を達成し、人間またはハイブリッドデータで訓練されたモデルを上回りました。分析により、PromptCoT 2.0が根本的に難しく分布的に異なる問題を生成することがさらに確認されました。これらの結果は、プロンプト合成を推論スケーリングの新しい軸として確立し、PromptCoT 2.0を将来のオープンソースモデルのためのスケーラブルな基盤として位置づけます。実装はhttps://github.com/inclusionAI/PromptCoTで利用可能です。

English

Large language models (LLMs) are evolving from conversational systems into strong reasoners for tasks such as Olympiad mathematics and competitive programming. While scaling parameters and test-time computation has driven progress, a key bottleneck is the lack of high-quality training problems: human-curated datasets are costly and limited, while existing synthetic corpora are often too easy or narrow. PromptCoT 1.0 showed that injecting rationales into prompt synthesis increases problem difficulty. Building on this, we present PromptCoT 2.0, a scalable framework that replaces hand-crafted heuristics with an expectation-maximization (EM) loop, where rationales are iteratively refined to guide prompt construction. This produces problems that are both harder and more diverse than prior corpora. The synthetic prompts support two post-training regimes: (1) Self-Play, where strong models improve autonomously via verifiable feedback without stronger teachers; and (2) Supervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled traces. Extensive experiments demonstrate the effectiveness of this approach. In self-play, applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 sets new state-of-the-art results at the 30B scale, with +4.4, +4.8, and +5.3 on AIME 24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on Codeforces. In SFT, training Qwen2.5-7B-Instruct solely on synthetic prompts boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5), surpassing models trained on human or hybrid data. Analyses further confirm that PromptCoT 2.0 yields fundamentally harder and distributionally distinct problems. These results establish prompt synthesis as a new axis for scaling reasoning and position PromptCoT 2.0 as a scalable foundation for future open-source models. The implementation is available at https://github.com/inclusionAI/PromptCoT.

PromptCoT 2.0: 大規模言語モデルの推論のためのプロンプト合成のスケーリング

PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning

要旨

Support