LLM 에이전트는 이미 도구 호출 시점을 알고 있다

초록

도구 강화 LLM 에이전트는 모델이 직접 답변할 수 있는 경우에도 무분별하게 도구를 호출하는 경향이 있다. 불필요한 호출 한 번마다 API 비용과 지연 시간이 낭비되지만, 기존 벤치마크 중에서는 도구 호출이 실제로 필요한 시점을 체계적으로 연구한 사례가 없다. 이에 우리는 When2Tool이라는 벤치마크를 제안한다. 이 벤치마크는 18개의 환경(단일 홉 15개, 다중 홉 3개)으로 구성되며, 도구 필요성의 세 가지 범주(계산 규모, 지식 경계, 실행 신뢰성)에 걸쳐 설계되었다. 각 범주는 통제된 난이도 수준을 통해 도구가 필요한 작업과 불필요한 작업 사이에 명확한 결정 경계를 만든다. 우리는 두 가지 계열의 학습 없는 기준선을 평가한다. 프롬프트 전용(Prompt-only)은 프롬프트를 변경하여 불필요한 호출을 억제하는 방식이고, 추론 후 행동(Reason-then-Act)은 모델이 행동에 앞서 도구 필요성을 추론하도록 요구하는 방식이다. 두 기준선 모두 제한적인 통제만 제공한다. 프롬프트 전용은 불필요한 호출과 함께 필요한 호출까지 억제하며, 추론 후 행동은 어려운 작업에서 여전히 불균형적으로 큰 정확도 손실을 초래한다. 이러한 기준선이 실패하는 이유를 이해하기 위해 우리는 모델의 은닉 상태를 조사했고, 도구 필요성이 생성 전 표현에서 선형적으로 디코딩 가능하다는 사실을 발견했다. 이는 여섯 개의 모델에 걸쳐 AUROC 0.89~0.96으로 측정되었으며, 모델이 스스로 언어화한 추론 성능을 크게 능가한다. 이는 모델이 이미 도구가 필요한 시점을 알고 있지만, 생성 과정에서 이 지식을 행동으로 옮기지 못한다는 것을 보여준다. 이 발견을 바탕으로 우리는 프로브 앤 프리필(Probe&Prefill)을 제안한다. 이 방법은 경량 선형 프로브를 사용하여 은닉 상태 신호를 읽고, 모델의 응답에 조종 문장을 미리 채워 넣는다. 테스트된 모든 모델에서 프로브 앤 프리필은 도구 호출을 48% 감소시키면서 정확도 손실은 1.7%에 불과했다. 반면, 비슷한 정확도를 유지하는 최고의 기준선은 도구 호출을 6%만 감소시키거나, 유사한 도구 호출 감소를 달성하지만 5배 더 높은 정확도 손실을 초래했다. 코드는 https://github.com/Trustworthy-ML-Lab/when2tool에서 확인할 수 있다.

English

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5times higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

LLM 에이전트는 이미 도구 호출 시점을 알고 있다 — 추론 없이도.

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

초록

Support