AlphaApollo：將基礎模型與專業工具整合為自我進化系統，實現深度代理推理

摘要

我們推出AlphaApollo，這是一個自我進化的代理推理系統，旨在解決基礎模型（FM）推理中的兩個瓶頸：模型內在能力的限制和測試時迭代的不可靠性。AlphaApollo通過協調多個模型與專業工具，實現了深思熟慮且可驗證的推理。它結合了（i）計算工具（配備數值與符號庫的Python）和（ii）檢索工具（任務相關的外部信息）來執行精確計算並基於數據做出決策。該系統進一步支持通過共享狀態圖進行多輪、多模型的解決方案演化，該圖記錄了候選方案、可執行檢查以及用於迭代改進的反饋。在AIME 2024/2025的評估中，針對多個模型，AlphaApollo展現了穩定的性能提升：Qwen2.5-14B-Instruct的Average@32提升了5.15%，Pass@32提升了23.34%；Llama-3.3-70B-Instruct的Average@32提升了8.91%，Pass@32提升了26.67%。工具使用分析顯示，超過80%的工具調用成功執行，且持續超越非工具基線，從而提升了FM的能力上限。更多實證結果與實現細節將更新於https://github.com/tmlr-group/AlphaApollo。

English

We present AlphaApollo, a self-evolving agentic reasoning system that aims to address two bottlenecks in foundation model (FM) reasoning-limited model-intrinsic capacity and unreliable test-time iteration. AlphaApollo orchestrates multiple models with professional tools to enable deliberate, verifiable reasoning. It couples (i) a computation tool (Python with numerical and symbolic libraries) and (ii) a retrieval tool (task-relevant external information) to execute exact calculations and ground decisions. The system further supports multi-round, multi-model solution evolution via a shared state map that records candidates, executable checks, and feedback for iterative refinement. In evaluations on AIME 2024/2025 across multiple models, AlphaApollo delivers consistent gains: +5.15% Average@32 and +23.34% Pass@32 for Qwen2.5-14B-Instruct, and +8.91% Average@32 with +26.67% Pass@32 for Llama-3.3-70B-Instruct. Tool-use analysis shows that more than 80% of tool calls are successfully executed, with consistent outperformance of non-tool baselines, thereby lifting the capability ceiling of FMs. More empirical results and implementation details will be updated at https://github.com/tmlr-group/AlphaApollo.

AlphaApollo：將基礎模型與專業工具整合為自我進化系統，實現深度代理推理

AlphaApollo: Orchestrating Foundation Models and Professional Tools into a Self-Evolving System for Deep Agentic Reasoning

摘要

Support