モデル適応型ツール必要性が明らかにするLLMツール使用における知識と行動のギャップ

要旨

大規模言語モデル（LLM）は、自律エージェントとしての役割を果たすことが増えており、直接応答すべきか、外部ツールを呼び出すべきかを判断する必要がある。適応的なツール使用に関する先行研究では、ツールの必要性はモデルに依存しない特性として扱われ、人間やLLMの判定者によってアノテーションされ、主に答えが明白なケース（例：天気情報の取得 vs. テキストの言い換え）を対象としてきた。しかし、実際のツール必要性は、モデル間での能力境界の差異により、より複雑である。すなわち、強力なモデルであれば単体で解決できる問題でも、弱いモデルにとってはツールが必要となる場合がある。本研究では、各モデルの実証的性能に基づいた、モデル適応型のツール必要性定義を導入する。この定義に従い、算術および事実QAデータセットにおいて、4つのモデルのツール呼び出し行動と必要性を比較した結果、それぞれ26.5〜54.0％、30.8〜41.8％という substantial な不一致を発見した。この問題を診断するため、ツール使用を2つの段階に分解する。すなわち、モデルがツールの必要性を認識しているかどうかを反映する「内部認知段階」と、モデルが実際にツール呼び出し行動を起こすかどうかを決定する「実行段階」である。LLMの隠れ状態を調査したところ、両方のシグナルは多くの場合線形分離可能であるが、次のトークン行動を駆動する後層・最終トークンの領域では、その探索方向がほぼ直交していることが判明した。この2段階プロセスにおけるサンプルの軌跡を追跡することで、不一致の大部分が認知から行動への遷移に集中しており、認知そのものにはないことがさらに明らかになった。これらの結果は、LLMのツール使用における「知覚と行動の乖離」を明らかにしている。ツール使用の信頼性を向上させるには、ツールが必要な状況をより適切に認識するだけでなく、その認識を行動へとより適切に変換することも必要である。

English

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.