PromptCoT 2.0:擴展提示合成以提升大型語言模型的推理能力
PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning
September 24, 2025
作者: Xueliang Zhao, Wei Wu, Jian Guan, Zhuocheng Gong, Lingpeng Kong
cs.AI
摘要
大型語言模型(LLMs)正從對話系統演進為強大的推理工具,適用於奧林匹克數學和競技編程等任務。雖然參數規模和測試時計算的擴展推動了進展,但一個關鍵瓶頸在於缺乏高質量的訓練問題:人工整理的數據集成本高昂且有限,而現有的合成語料庫往往過於簡單或狹窄。PromptCoT 1.0 展示了在提示合成中注入推理過程可以增加問題難度。在此基礎上,我們提出了 PromptCoT 2.0,這是一個可擴展的框架,用期望最大化(EM)循環取代了手工設計的啟發式方法,通過迭代精煉推理過程來指導提示構建。這產生了比以往語料庫更難且更多樣化的問題。這些合成提示支持兩種後訓練機制:(1)自我對弈,其中強模型通過可驗證的反饋自主改進,無需更強的老師;(2)監督微調(SFT),其中較弱模型從老師蒸餾的軌跡中學習。大量實驗證明了這種方法的有效性。在自我對弈中,將 PromptCoT 2.0 應用於 Qwen3-30B-A3B-Thinking-2507 在 30B 規模上取得了新的最先進成果,在 AIME 24/25 和 HMMT 25 上分別提升了 +4.4、+4.8 和 +5.3,在 LiveCodeBench v5/v6 上分別提升了 +6.1 和 +5.0,在 Codeforces 上提升了 +35 Elo。在 SFT 中,僅使用合成提示訓練 Qwen2.5-7B-Instruct 將準確率提升至 73.1(AIME 24)、65.6(AIME 25)和 53.4(LiveCodeBench v5),超越了使用人類或混合數據訓練的模型。分析進一步證實,PromptCoT 2.0 產生了本質上更難且分佈不同的問題。這些結果確立了提示合成作為擴展推理的新維度,並將 PromptCoT 2.0 定位為未來開源模型的可擴展基礎。實現代碼可在 https://github.com/inclusionAI/PromptCoT 獲取。
English
Large language models (LLMs) are evolving from conversational systems into
strong reasoners for tasks such as Olympiad mathematics and competitive
programming. While scaling parameters and test-time computation has driven
progress, a key bottleneck is the lack of high-quality training problems:
human-curated datasets are costly and limited, while existing synthetic corpora
are often too easy or narrow. PromptCoT 1.0 showed that injecting rationales
into prompt synthesis increases problem difficulty. Building on this, we
present PromptCoT 2.0, a scalable framework that replaces hand-crafted
heuristics with an expectation-maximization (EM) loop, where rationales are
iteratively refined to guide prompt construction. This produces problems that
are both harder and more diverse than prior corpora. The synthetic prompts
support two post-training regimes: (1) Self-Play, where strong models improve
autonomously via verifiable feedback without stronger teachers; and (2)
Supervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled
traces. Extensive experiments demonstrate the effectiveness of this approach.
In self-play, applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 sets new
state-of-the-art results at the 30B scale, with +4.4, +4.8, and +5.3 on AIME
24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on
Codeforces. In SFT, training Qwen2.5-7B-Instruct solely on synthetic prompts
boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5),
surpassing models trained on human or hybrid data. Analyses further confirm
that PromptCoT 2.0 yields fundamentally harder and distributionally distinct
problems. These results establish prompt synthesis as a new axis for scaling
reasoning and position PromptCoT 2.0 as a scalable foundation for future
open-source models. The implementation is available at
https://github.com/inclusionAI/PromptCoT.