AlphaApollo：将基础模型与专业工具整合为自演进系统，实现深度代理推理

摘要

我们推出AlphaApollo，一个旨在解决基础模型（FM）推理中两大瓶颈——有限模型内在能力与不可靠测试时迭代——的自进化代理推理系统。AlphaApollo通过协调多个模型与专业工具，实现了深思熟虑且可验证的推理过程。它结合了（i）计算工具（配备数值与符号库的Python）和（ii）检索工具（任务相关的外部信息），以执行精确计算并确保决策的落地。该系统进一步通过共享状态地图支持多轮次、多模型的解决方案演进，该地图记录了候选方案、可执行检查及迭代优化的反馈。在AIME 2024/2025的评估中，针对多个模型，AlphaApollo展现了稳定的性能提升：Qwen2.5-14B-Instruct模型在Average@32指标上提升了5.15%，Pass@32指标上提升了23.34%；Llama-3.3-70B-Instruct模型在Average@32指标上提升了8.91%，Pass@32指标上提升了26.67%。工具使用分析显示，超过80%的工具调用成功执行，持续超越非工具基线，从而提升了基础模型的能力上限。更多实证结果与实现细节将更新于https://github.com/tmlr-group/AlphaApollo。

English

We present AlphaApollo, a self-evolving agentic reasoning system that aims to address two bottlenecks in foundation model (FM) reasoning-limited model-intrinsic capacity and unreliable test-time iteration. AlphaApollo orchestrates multiple models with professional tools to enable deliberate, verifiable reasoning. It couples (i) a computation tool (Python with numerical and symbolic libraries) and (ii) a retrieval tool (task-relevant external information) to execute exact calculations and ground decisions. The system further supports multi-round, multi-model solution evolution via a shared state map that records candidates, executable checks, and feedback for iterative refinement. In evaluations on AIME 2024/2025 across multiple models, AlphaApollo delivers consistent gains: +5.15% Average@32 and +23.34% Pass@32 for Qwen2.5-14B-Instruct, and +8.91% Average@32 with +26.67% Pass@32 for Llama-3.3-70B-Instruct. Tool-use analysis shows that more than 80% of tool calls are successfully executed, with consistent outperformance of non-tool baselines, thereby lifting the capability ceiling of FMs. More empirical results and implementation details will be updated at https://github.com/tmlr-group/AlphaApollo.

AlphaApollo：将基础模型与专业工具整合为自演进系统，实现深度代理推理

AlphaApollo: Orchestrating Foundation Models and Professional Tools into a Self-Evolving System for Deep Agentic Reasoning

摘要

Support