모델 적응형 도구 필요성은 LLM 도구 사용에서 지식-실행 격차를 드러낸다

초록

대규모 언어 모델(LLM)은 점점 더 자율 에이전트로서 직접 답변해야 하는 시점과 외부 도구를 호출해야 하는 시점을 결정해야 한다. 적응적 도구 사용에 관한 기존 연구는 대체로 도구 필요성을 모델 무관 속성으로 간주하여 인간 또는 LLM 판정자가 주석을 달았으며, 주로 답변이 명확한 경우(예: 날씨 정보 가져오기 대 텍스트 요약)를 다루었다. 그러나 실제 환경에서 도구 필요성은 모델 간 능력 경계의 차이로 인해 더 미묘하다. 강력한 모델이 자체적으로 해결할 수 있는 문제라도 약한 모델에게는 여전히 도구가 필요할 수 있기 때문이다. 본 연구에서는 각 모델의 경험적 성능에 기반한 모델 적응형 도구 필요성 정의를 도입한다. 이 정의에 따라 산술 및 사실 QA 데이터셋에서 네 가지 모델의 관찰된 도구 호출 행동과 필요성을 비교한 결과, 각각 26.5-54.0% 및 30.8-41.8%의 상당한 불일치를 발견했다. 실패를 진단하기 위해 도구 사용을 두 단계로 분해한다: 모델이 도구가 필요하다고 믿는지 여부를 반영하는 내부 인지 단계와 모델이 실제로 도구 호출 행동을 수행할지 결정하는 실행 단계. LLM 은닉 상태를 탐침함으로써 두 신호가 종종 선형적으로 디코딩 가능하지만, 다음 토큰 행동을 주도하는 후기 계층, 마지막 토큰 영역에서는 탐침 방향이 거의 직교하게 됨을 발견했다. 두 단계 과정에서 샘플의 궤적을 추적함으로써 불일치의 대부분이 인지 자체가 아닌 인지-행동 전환에 집중되어 있음을 추가로 발견했다. 이러한 결과는 LLM 도구 사용에 인지-실행 격차가 있음을 보여준다. 도구 사용의 신뢰성을 개선하기 위해서는 도구가 필요한 시점을 더 잘 인식하는 것뿐만 아니라 그 인식을 행동으로 더 잘 전환하는 것도 필요하다.

English

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.