PromptCoT 2.0：擴展提示合成以提升大型語言模型的推理能力

摘要

大型語言模型（LLMs）正從對話系統演進為強大的推理工具，適用於奧林匹克數學和競技編程等任務。雖然參數規模和測試時計算的擴展推動了進展，但一個關鍵瓶頸在於缺乏高質量的訓練問題：人工整理的數據集成本高昂且有限，而現有的合成語料庫往往過於簡單或狹窄。PromptCoT 1.0 展示了在提示合成中注入推理過程可以增加問題難度。在此基礎上，我們提出了 PromptCoT 2.0，這是一個可擴展的框架，用期望最大化（EM）循環取代了手工設計的啟發式方法，通過迭代精煉推理過程來指導提示構建。這產生了比以往語料庫更難且更多樣化的問題。這些合成提示支持兩種後訓練機制：（1）自我對弈，其中強模型通過可驗證的反饋自主改進，無需更強的老師；（2）監督微調（SFT），其中較弱模型從老師蒸餾的軌跡中學習。大量實驗證明了這種方法的有效性。在自我對弈中，將 PromptCoT 2.0 應用於 Qwen3-30B-A3B-Thinking-2507 在 30B 規模上取得了新的最先進成果，在 AIME 24/25 和 HMMT 25 上分別提升了 +4.4、+4.8 和 +5.3，在 LiveCodeBench v5/v6 上分別提升了 +6.1 和 +5.0，在 Codeforces 上提升了 +35 Elo。在 SFT 中，僅使用合成提示訓練 Qwen2.5-7B-Instruct 將準確率提升至 73.1（AIME 24）、65.6（AIME 25）和 53.4（LiveCodeBench v5），超越了使用人類或混合數據訓練的模型。分析進一步證實，PromptCoT 2.0 產生了本質上更難且分佈不同的問題。這些結果確立了提示合成作為擴展推理的新維度，並將 PromptCoT 2.0 定位為未來開源模型的可擴展基礎。實現代碼可在 https://github.com/inclusionAI/PromptCoT 獲取。

English

Large language models (LLMs) are evolving from conversational systems into strong reasoners for tasks such as Olympiad mathematics and competitive programming. While scaling parameters and test-time computation has driven progress, a key bottleneck is the lack of high-quality training problems: human-curated datasets are costly and limited, while existing synthetic corpora are often too easy or narrow. PromptCoT 1.0 showed that injecting rationales into prompt synthesis increases problem difficulty. Building on this, we present PromptCoT 2.0, a scalable framework that replaces hand-crafted heuristics with an expectation-maximization (EM) loop, where rationales are iteratively refined to guide prompt construction. This produces problems that are both harder and more diverse than prior corpora. The synthetic prompts support two post-training regimes: (1) Self-Play, where strong models improve autonomously via verifiable feedback without stronger teachers; and (2) Supervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled traces. Extensive experiments demonstrate the effectiveness of this approach. In self-play, applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 sets new state-of-the-art results at the 30B scale, with +4.4, +4.8, and +5.3 on AIME 24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on Codeforces. In SFT, training Qwen2.5-7B-Instruct solely on synthetic prompts boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5), surpassing models trained on human or hybrid data. Analyses further confirm that PromptCoT 2.0 yields fundamentally harder and distributionally distinct problems. These results establish prompt synthesis as a new axis for scaling reasoning and position PromptCoT 2.0 as a scalable foundation for future open-source models. The implementation is available at https://github.com/inclusionAI/PromptCoT.

PromptCoT 2.0：擴展提示合成以提升大型語言模型的推理能力

PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning

摘要

Support