A^2FM:一個適應性代理基礎模型,用於工具感知的混合推理
A^2FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning
October 13, 2025
作者: Qianben Chen, Jingyi Cao, Jiayu Zhang, Tianrui Qin, Xiaowan Li, King Zhu, Dingfeng Shi, He Zhu, Minghao Liu, Xiaobo Liang, Xin Gui, Ge Zhang, Jian Yang, Yuchen Eleanor Jiang, Wangchunshu Zhou
cs.AI
摘要
大型語言模型分為兩大類:以推理為核心的LLM,它們強化內部的思維鏈推理但無法調用外部工具;以及代理型LLM,它們學習與環境互動並利用工具,但在深度推理上往往表現不足。這種分化源於根本不同的訓練目標,導致在處理簡單查詢時出現能力不匹配和效率低下的問題,兩類模型都傾向於過度思考或過度調用工具。在本研究中,我們提出了自適應代理基礎模型(A^2FM),這是一個遵循“先路由後對齊”原則的統一框架:模型首先學習任務感知的路由,然後在共享骨幹下對齊特定模式的軌跡。為解決效率差距,我們引入了第三種模式——即時模式,直接處理簡單查詢,避免不必要的推理或工具調用,同時補充代理和推理模式。為了共同提升準確性和效率,我們提出了自適應策略優化(APO),它強制跨模式的自適應採樣並應用成本正則化的獎勵。在32B規模上,A^2FM在BrowseComp上達到13.4%,在AIME25上達到70.4%,在HLE上達到16.7%,在可比模型中創下新的SOTA,並在代理、推理和通用基準測試中與前沿LLM競爭。值得注意的是,自適應執行的每次正確答案成本僅為$0.00487——相對於推理模式成本降低45.2%,相對於代理模式降低33.5%,從而實現了顯著更高的成本效率,同時保持了可比的準確性。
English
Large language models split into two families: reasoning-centric LLMs, which
strengthen internal chain-of-thought reasoning but cannot invoke external
tools, and agentic LLMs, which learn to interact with environments and leverage
tools but often lag in deep reasoning. This divide arises from fundamentally
different training objectives, leading to mismatched strengths and inefficiency
on simple queries, where both families tend to overthink or over-call tools. In
this work, we present Adaptive Agent Foundation Model (A^2FM), a unified
framework that follows a route-then-align principle: the model first learns
task-aware routing and then aligns mode-specific trajectories under a shared
backbone. To address the inefficiency gap, we introduce a third
mode-instant-that handles simple queries directly, preventing unnecessary
reasoning or tool calls while complementing the agentic and reasoning modes. To
jointly enhance accuracy and efficiency, we propose Adaptive Policy
Optimization (APO), which enforces adaptive sampling across modes and applies a
cost-regularized reward. On the 32B scale, A^2FM achieves 13.4% on
BrowseComp, 70.4% on AIME25, and 16.7% on HLE, setting new SOTA among
comparable models and performing competitively with frontier LLMs across
agentic, reasoning, and general benchmarks. Notably, the adaptive execution
achieves a cost of pass of only $0.00487 per correct answer-cutting cost by
45.2% relative to reasoning and 33.5% relative to agentic, thus delivering
substantially higher cost efficiency while maintaining comparable accuracy.