苏格拉底零号:通过无数据代理协同进化引导推理能力
Socratic-Zero : Bootstrapping Reasoning via Data-Free Agent Co-evolution
September 29, 2025
作者: Shaobo Wang, Zhengbo Jiao, Zifan Zhang, Yilang Peng, Xu Ze, Boyu Yang, Wei Wang, Hu Wei, Linfeng Zhang
cs.AI
摘要
近期,大型语言模型(LLMs)在推理任务上的重大突破主要依赖于大规模、高质量的数据集——这些数据集通常由人工标注,因此难以扩展。尽管数据合成或蒸馏提供了一种有前景的替代方案,但现有方法在数据质量不一致和无法动态适应模型能力演变方面存在不足,导致训练信号不理想。为解决这些局限,我们引入了Socratic-Zero,这是一个完全自主的框架,通过三个代理——教师、求解器和生成器的协同进化,从少量种子示例中生成高质量的训练数据。求解器通过从成功和失败轨迹的偏好反馈中不断优化其推理能力;教师根据求解器的弱点自适应地设计越来越具挑战性的问题;生成器则提炼教师的问题设计策略,以实现可扩展、高保真的课程生成。这一闭环系统产生了一个自我改进的课程——无需预先存在的任务或标签。值得注意的是,仅从100个种子问题出发,我们的Socratic-Solver-8B在七个数学推理基准(AMC23、AIME24-25、奥赛、MATH-500、Minerva和GSM8K)上,相较于先前的数据合成方法,平均提升了20.2个百分点,且在Qwen3和GLM4系列模型上均表现出持续的优势。更令人惊讶的是,Socratic-Generator-32B生成的合成数据使得学生LLMs在这些基准上的表现超越了其他最先进的(SOTA)商业LLMs,包括Qwen3-235B-A22B、DeepSeek-V3.1-671B、GPT-5、Gemini-2.5-Pro、Grok-4和Claude-4.1-Opus。
English
Recent breakthroughs in large language models (LLMs) on reasoning tasks rely
heavily on massive, high-quality datasets-typically human-annotated and thus
difficult to scale. While data synthesis or distillation offers a promising
alternative, existing methods struggle with inconsistent data quality and an
inability to dynamically adapt to the evolving capabilities of the model,
leading to suboptimal training signals. To address these limitations, we
introduce Socratic-Zero, a fully autonomous framework that generates
high-quality training data from minimal seed examples through the co-evolution
of three agents: the Teacher, the Solver, and the Generator. The Solver
continuously refines its reasoning by learning from preference feedback on both
successful and failed trajectories; the Teacher adaptively crafts increasingly
challenging questions based on the Solver's weaknesses; and the Generator
distills the Teacher's question-design strategy to enable scalable,
high-fidelity curriculum generation. This closed-loop system produces a
self-improving curriculum-requiring no pre-existing tasks or labels.
Remarkably, starting from only 100 seed questions, our Socratic-Solver-8B
achieves an average gain of +20.2 percentage points over prior data synthesis
methods across seven mathematical reasoning benchmarks (AMC23, AIME24-25,
Olympiad, MATH-500, Minerva, and GSM8K), with consistent gains on both Qwen3
and GLM4 series models. Even more surprisingly, synthetic data from
Socratic-Generator-32B enables student LLMs to achieve superior performance
compared to other state-of-the-art (SOTA) commercial LLMs on these benchmarks,
including Qwen3-235B-A22B, DeepSeek-V3.1-671B, GPT-5, Gemini-2.5-Pro, Grok-4,
and Claude-4.1-Opus.