PromptCoT 2.0: 대규모 언어 모델 추론을 위한 프롬프트 합성의 확장

초록

대규모 언어 모델(LLMs)은 대화형 시스템에서 올림피아드 수학 및 경쟁 프로그래밍과 같은 작업을 위한 강력한 추론자로 진화하고 있습니다. 매개변수와 테스트 시간 계산의 확장이 발전을 이끌어 왔지만, 주요 병목 현상은 고품질의 훈련 문제의 부족입니다: 인간이 직접 선별한 데이터셋은 비용이 많이 들고 제한적이며, 기존의 합성 코퍼스는 너무 쉬우거나 범위가 좁습니다. PromptCoT 1.0은 프롬프트 합성에 논리를 주입함으로써 문제의 난이도를 높일 수 있음을 보여주었습니다. 이를 기반으로, 우리는 PromptCoT 2.0을 제시합니다. 이는 수작업 휴리스틱을 기대값 최대화(EM) 루프로 대체하여, 논리가 반복적으로 개선되어 프롬프트 구성을 안내하는 확장 가능한 프레임워크입니다. 이를 통해 이전 코퍼스보다 더 어렵고 다양한 문제를 생성합니다. 합성 프롬프트는 두 가지 사후 훈련 체제를 지원합니다: (1) 셀프 플레이, 강력한 모델이 더 강력한 교사 없이 검증 가능한 피드백을 통해 자율적으로 개선되는 방식; (2) 지도 미세 조정(SFT), 약한 모델이 교사가 증류한 흔적에서 학습하는 방식. 광범위한 실험을 통해 이 접근법의 효과를 입증했습니다. 셀프 플레이에서, PromptCoT 2.0을 Qwen3-30B-A3B-Thinking-2507에 적용하여 30B 규모에서 최신 기술을 달성했습니다: AIME 24/25와 HMMT 25에서 각각 +4.4, +4.8, +5.3, LiveCodeBench v5/v6에서 +6.1과 +5.0, Codeforces에서 +35 Elo를 기록했습니다. SFT에서는, Qwen2.5-7B-Instruct를 합성 프롬프트만으로 훈련시켜 AIME 24에서 73.1, AIME 25에서 65.6, LiveCodeBench v5에서 53.4의 정확도를 달성하여 인간 또는 하이브리드 데이터로 훈련된 모델을 능가했습니다. 분석은 또한 PromptCoT 2.0이 근본적으로 더 어렵고 분포적으로 독특한 문제를 생성함을 확인했습니다. 이러한 결과는 프롬프트 합성을 추론 확장을 위한 새로운 축으로 확립하고, PromptCoT 2.0을 미래의 오픈소스 모델을 위한 확장 가능한 기반으로 위치시킵니다. 구현은 https://github.com/inclusionAI/PromptCoT에서 확인할 수 있습니다.

English

Large language models (LLMs) are evolving from conversational systems into strong reasoners for tasks such as Olympiad mathematics and competitive programming. While scaling parameters and test-time computation has driven progress, a key bottleneck is the lack of high-quality training problems: human-curated datasets are costly and limited, while existing synthetic corpora are often too easy or narrow. PromptCoT 1.0 showed that injecting rationales into prompt synthesis increases problem difficulty. Building on this, we present PromptCoT 2.0, a scalable framework that replaces hand-crafted heuristics with an expectation-maximization (EM) loop, where rationales are iteratively refined to guide prompt construction. This produces problems that are both harder and more diverse than prior corpora. The synthetic prompts support two post-training regimes: (1) Self-Play, where strong models improve autonomously via verifiable feedback without stronger teachers; and (2) Supervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled traces. Extensive experiments demonstrate the effectiveness of this approach. In self-play, applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 sets new state-of-the-art results at the 30B scale, with +4.4, +4.8, and +5.3 on AIME 24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on Codeforces. In SFT, training Qwen2.5-7B-Instruct solely on synthetic prompts boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5), surpassing models trained on human or hybrid data. Analyses further confirm that PromptCoT 2.0 yields fundamentally harder and distributionally distinct problems. These results establish prompt synthesis as a new axis for scaling reasoning and position PromptCoT 2.0 as a scalable foundation for future open-source models. The implementation is available at https://github.com/inclusionAI/PromptCoT.

PromptCoT 2.0: 대규모 언어 모델 추론을 위한 프롬프트 합성의 확장

PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning

초록

Support