明智行动:在能动多模态模型中培养元认知工具使用能力
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
April 9, 2026
作者: Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou
cs.AI
摘要
智能多模态模型的出现使系统能够主动与外部环境交互。然而当前智能体存在严重的元认知缺陷:它们难以在利用内部知识与调用外部工具之间做出有效权衡。这导致其频繁陷入盲目调用工具的误区,即使查询可直接从原始视觉语境中解决,仍会机械性地执行工具调用。这种病态行为不仅引发严重的延迟瓶颈,还会引入干扰噪声从而破坏有效推理。现有强化学习方案试图通过惩罚工具使用量的标量化奖励来缓解此问题,但这种耦合式设计造成了不可调和的优化困境:激进惩罚会抑制必要工具使用,而温和惩罚在优势归一化过程中会被准确率奖励的方差完全淹没,无法遏制工具滥用。
为突破此瓶颈,我们提出HDPO框架,将工具效率从竞争性标量目标重构为严格的条件化目标。通过摒弃奖励标量化,HDPO维持两个正交优化通道:致力于最大化任务准确率的精度通道,以及通过条件优势估计仅在正确轨迹上强制执行经济性的效率通道。这种解耦架构自然形成认知课程——强制智能体先掌握任务解决能力,再优化自主决策能力。大量实验表明,我们最终训练的Metis模型在将工具调用量降低数个数量级的同时,还提升了推理准确率。
English
The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.