LLM 代理已經知道何時調用工具——即使不經推理

摘要

工具增強型大型語言模型代理往往不加區別地呼叫工具，即使模型能直接回答問題。每次不必要的呼叫都會浪費API費用與延遲，然而現有基準測試並未系統性地研究何時才真正需要工具呼叫。我們提出When2Tool基準測試，涵蓋18個環境（15個單跳、3個多跳），橫跨三類工具必要性——計算規模、知識邊界與執行可靠性——每類皆具備可控的難度等級，在工具必要與非必要任務間形成明確的決策邊界。我們評估兩類免訓練基線方法：純提示（調整提示以抑制不必要的呼叫）與先推理再行動（要求模型在行動前推理工具必要性）。兩者控制力有限：純提示在抑制非必要呼叫的同時也壓制了必要的呼叫；先推理再行動則在困難任務上產生不成比例的準確度損失。為理解這些基線為何失效，我們探測模型的隱藏狀態，發現工具必要性可從六個模型生成前表徵中以線性方式解碼，AUROC達0.89至0.96，遠超過模型自身口語化的推理。這揭示模型已知道何時需要工具，但在生成過程中未能依據此知識行動。基於此發現，我們提出Probe&Prefill方法，使用輕量線性探針讀取隱藏狀態信號，並在模型回應前預填一個引導句。在測試的所有模型中，Probe&Prefill減少了48%的工具呼叫，僅損失1.7%準確度，而相同準確度下的最佳基線僅減少6%工具呼叫，或達到類似工具呼叫減少卻導致五倍準確度損失。我們的程式碼已公開於 https://github.com/Trustworthy-ML-Lab/when2tool

English

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5times higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

LLM 代理已經知道何時調用工具——即使不經推理

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

摘要

Support