소크라틱-제로: 데이터 없는 에이전트 공진화를 통한 추론 능력 부트스트래핑

초록

최근 대규모 언어 모델(LLM)의 추론 과제에서의 돌파구는 대규모의 고품질 데이터셋에 크게 의존하고 있습니다. 이러한 데이터셋은 일반적으로 인간이 주석을 달아야 하기 때문에 확장하기 어렵습니다. 데이터 합성 또는 증류가 유망한 대안으로 제시되고 있지만, 기존 방법들은 데이터 품질의 불일치와 모델의 진화하는 능력에 동적으로 적응하지 못하는 문제로 인해 최적의 훈련 신호를 제공하지 못하고 있습니다. 이러한 한계를 해결하기 위해, 우리는 최소한의 시드 예제로부터 고품질의 훈련 데이터를 생성하는 완전 자율 프레임워크인 Socratic-Zero를 소개합니다. 이 프레임워크는 세 가지 에이전트(Teacher, Solver, Generator)의 공진화를 통해 작동합니다. Solver는 성공적이거나 실패한 궤적에 대한 선호 피드백을 학습하여 지속적으로 추론을 개선하고, Teacher는 Solver의 약점을 기반으로 점점 더 어려운 질문을 적응적으로 제작하며, Generator는 Teacher의 질문 설계 전략을 증류하여 확장 가능하고 고충실도의 커리큘럼 생성을 가능하게 합니다. 이 폐쇄 루프 시스템은 사전에 존재하는 작업이나 레이블 없이도 자기 개선 커리큘럼을 생성합니다. 놀랍게도, 단 100개의 시드 질문으로 시작한 Socratic-Solver-8B는 7개의 수학적 추론 벤치마크(AMC23, AIME24-25, Olympiad, MATH-500, Minerva, GSM8K)에서 기존 데이터 합성 방법 대비 평균 +20.2% 포인트의 성능 향상을 달성했으며, Qwen3 및 GLM4 시리즈 모델에서도 일관된 성능 향상을 보였습니다. 더욱 놀라운 점은, Socratic-Generator-32B에서 생성된 합성 데이터가 학생 LLM들이 Qwen3-235B-A22B, DeepSeek-V3.1-671B, GPT-5, Gemini-2.5-Pro, Grok-4, Claude-4.1-Opus를 포함한 다른 최첨단(SOTA) 상용 LLM들을 능가하는 성능을 달성할 수 있게 했다는 것입니다.

English

Recent breakthroughs in large language models (LLMs) on reasoning tasks rely heavily on massive, high-quality datasets-typically human-annotated and thus difficult to scale. While data synthesis or distillation offers a promising alternative, existing methods struggle with inconsistent data quality and an inability to dynamically adapt to the evolving capabilities of the model, leading to suboptimal training signals. To address these limitations, we introduce Socratic-Zero, a fully autonomous framework that generates high-quality training data from minimal seed examples through the co-evolution of three agents: the Teacher, the Solver, and the Generator. The Solver continuously refines its reasoning by learning from preference feedback on both successful and failed trajectories; the Teacher adaptively crafts increasingly challenging questions based on the Solver's weaknesses; and the Generator distills the Teacher's question-design strategy to enable scalable, high-fidelity curriculum generation. This closed-loop system produces a self-improving curriculum-requiring no pre-existing tasks or labels. Remarkably, starting from only 100 seed questions, our Socratic-Solver-8B achieves an average gain of +20.2 percentage points over prior data synthesis methods across seven mathematical reasoning benchmarks (AMC23, AIME24-25, Olympiad, MATH-500, Minerva, and GSM8K), with consistent gains on both Qwen3 and GLM4 series models. Even more surprisingly, synthetic data from Socratic-Generator-32B enables student LLMs to achieve superior performance compared to other state-of-the-art (SOTA) commercial LLMs on these benchmarks, including Qwen3-235B-A22B, DeepSeek-V3.1-671B, GPT-5, Gemini-2.5-Pro, Grok-4, and Claude-4.1-Opus.

소크라틱-제로: 데이터 없는 에이전트 공진화를 통한 추론 능력 부트스트래핑

Socratic-Zero : Bootstrapping Reasoning via Data-Free Agent Co-evolution

초록

Support