PromptCoT 2.0:面向大规模语言模型推理的提示合成扩展
PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning
September 24, 2025
作者: Xueliang Zhao, Wei Wu, Jian Guan, Zhuocheng Gong, Lingpeng Kong
cs.AI
摘要
大型语言模型(LLMs)正从对话系统演变为解决奥数及编程竞赛等任务的高效推理者。尽管参数规模与测试时计算的扩展推动了进展,但高质量训练问题的匮乏成为关键瓶颈:人工整理的数据集成本高且有限,而现有合成语料库往往过于简单或狭窄。PromptCoT 1.0 展示了将推理链注入提示合成可提升问题难度。在此基础上,我们推出 PromptCoT 2.0,一个可扩展的框架,它用期望最大化(EM)循环替代了手工设计的启发式方法,通过迭代优化推理链来指导提示构建,从而生成比以往语料库更困难且更多样化的问题。这些合成提示支持两种后训练模式:(1)自我对弈,强模型通过可验证反馈自主提升,无需更强导师;(2)监督微调(SFT),弱模型从教师提炼的轨迹中学习。大量实验验证了该方法的有效性。在自我对弈中,将 PromptCoT 2.0 应用于 Qwen3-30B-A3B-Thinking-2507,在 30B 规模上创下新纪录,AIME 24/25 和 HMMT 25 分别提升 +4.4、+4.8 和 +5.3,LiveCodeBench v5/v6 提升 +6.1 和 +5.0,Codeforces 上 Elo 分数增加 35。在 SFT 中,仅用合成提示训练 Qwen2.5-7B-Instruct,准确率提升至 73.1(AIME 24)、65.6(AIME 25)和 53.4(LiveCodeBench v5),超越基于人类或混合数据训练的模型。进一步分析证实,PromptCoT 2.0 生成的问题本质上更困难且分布独特。这些成果确立了提示合成作为扩展推理能力的新维度,并将 PromptCoT 2.0 定位为未来开源模型的可扩展基础。实现代码已发布于 https://github.com/inclusionAI/PromptCoT。
English
Large language models (LLMs) are evolving from conversational systems into
strong reasoners for tasks such as Olympiad mathematics and competitive
programming. While scaling parameters and test-time computation has driven
progress, a key bottleneck is the lack of high-quality training problems:
human-curated datasets are costly and limited, while existing synthetic corpora
are often too easy or narrow. PromptCoT 1.0 showed that injecting rationales
into prompt synthesis increases problem difficulty. Building on this, we
present PromptCoT 2.0, a scalable framework that replaces hand-crafted
heuristics with an expectation-maximization (EM) loop, where rationales are
iteratively refined to guide prompt construction. This produces problems that
are both harder and more diverse than prior corpora. The synthetic prompts
support two post-training regimes: (1) Self-Play, where strong models improve
autonomously via verifiable feedback without stronger teachers; and (2)
Supervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled
traces. Extensive experiments demonstrate the effectiveness of this approach.
In self-play, applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 sets new
state-of-the-art results at the 30B scale, with +4.4, +4.8, and +5.3 on AIME
24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on
Codeforces. In SFT, training Qwen2.5-7B-Instruct solely on synthetic prompts
boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5),
surpassing models trained on human or hybrid data. Analyses further confirm
that PromptCoT 2.0 yields fundamentally harder and distributionally distinct
problems. These results establish prompt synthesis as a new axis for scaling
reasoning and position PromptCoT 2.0 as a scalable foundation for future
open-source models. The implementation is available at
https://github.com/inclusionAI/PromptCoT.