Xolver：多智能體推理與整體經驗學習，宛如奧林匹克團隊

摘要

尽管在复杂推理方面取得了显著进展，当前的大型语言模型（LLMs）通常孤立运作——将每个问题视为独立尝试，而不积累或整合经验知识。相比之下，专家级问题解决者——如奥林匹克竞赛或编程竞赛团队——则利用丰富的经验网络：吸收教练的指导，从过往问题中培养直觉，运用工具使用和库功能的知识，根据同伴的专业知识和经验调整策略，通过试错不断精炼推理，甚至在比赛期间从其他相关问题中学习。我们引入了Xolver，一个无需训练的多智能体推理框架，它为黑箱LLM配备了持久且不断演化的整体经验记忆。Xolver整合了多种经验模式，包括外部与自我检索、工具使用、协作互动、智能体驱动的评估以及迭代优化。通过在推理时学习相关策略、代码片段和抽象推理模式，Xolver避免了从零开始生成解决方案——标志着从孤立推理向经验感知语言智能体的转变。基于开源权重和专有模型构建，Xolver在多个方面持续超越专门化的推理智能体。即便采用轻量级骨干（如QWQ-32B），它也常常超越包括Qwen3-235B、Gemini 2.5 Pro、o3和o4-mini-high在内的先进模型。使用o3-mini-high时，它在GSM8K（98.1%）、AIME'24（94.4%）、AIME'25（93.7%）、Math-500（99.8%）和LiveCodeBench-V5（91.6%）上创下新纪录，凸显了整体经验学习作为迈向具备专家级推理能力的通用智能体的关键一步。代码与数据可在https://kagnlp.github.io/xolver.github.io/获取。

English

Despite impressive progress on complex reasoning, current large language models (LLMs) typically operate in isolation - treating each problem as an independent attempt, without accumulating or integrating experiential knowledge. In contrast, expert problem solvers - such as Olympiad or programming contest teams - leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. We introduce Xolver, a training-free multi-agent reasoning framework that equips a black-box LLM with a persistent, evolving memory of holistic experience. Xolver integrates diverse experience modalities, including external and self-retrieval, tool use, collaborative interactions, agent-driven evaluation, and iterative refinement. By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, Xolver avoids generating solutions from scratch - marking a transition from isolated inference toward experience-aware language agents. Built on both open-weight and proprietary models, Xolver consistently outperforms specialized reasoning agents. Even with lightweight backbones (e.g., QWQ-32B), it often surpasses advanced models including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high. With o3-mini-high, it achieves new best results on GSM8K (98.1%), AIME'24 (94.4%), AIME'25 (93.7%), Math-500 (99.8%), and LiveCodeBench-V5 (91.6%) - highlighting holistic experience learning as a key step toward generalist agents capable of expert-level reasoning. Code and data are available at https://kagnlp.github.io/xolver.github.io/.

Xolver：多智能體推理與整體經驗學習，宛如奧林匹克團隊

Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team

摘要

Support