LLM智能体已经知道何时调用工具——即便不进行推理

摘要

工具增强型LLM智能体往往不加区分地调用工具，即使模型能够直接回答。每一次不必要的调用都会浪费API费用和增加延迟，然而现有基准尚未系统研究何时才真正需要工具调用。我们提出When2Tool基准，包含18个环境（15个单跳、3个多跳），涵盖三类工具必要性——计算规模、知识边界和执行可靠性——每类均设有受控难度等级，在需要工具与不需要工具的任务之间构建清晰的决策边界。我们评估两类免训练基线：仅提示（通过修改提示以抑制不必要的调用）和先推理后行动（要求模型在行动前先推理工具必要性）。两者均提供有限控制：仅提示在抑制不必要调用的同时也会抑制必要调用，而先推理后行动在困难任务上仍会带来不成比例的准确率损失。为理解这些基线失败的原因，我们探测模型隐藏状态，发现工具必要性可从预生成表示中以线性解码方式获取，在六个模型上的AUROC达到0.89-0.96，显著超过模型自身的语言化推理。这表明模型已经知道何时需要工具，但在生成过程中未能依据这一知识行动。基于此发现，我们提出探测与预填充方法，利用轻量级线性探针读取隐藏状态信号，并在模型响应前预填充一个引导句。在测试的所有模型上，探测与预填充将工具调用减少48%，仅损失1.7%的准确率；而可比准确率下的最佳基线仅减少6%的工具调用，或实现类似工具调用减少但准确率损失高出5倍。我们的代码已开源：https://github.com/Trustworthy-ML-Lab/when2tool

English

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5times higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

LLM智能体已经知道何时调用工具——即便不进行推理

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

摘要

Support