模型自適應工具必要性揭示大型語言模型工具使用中的知行差距

摘要

大型語言模型（LLM）逐漸成為自主代理，必須決定何時直接回答，何時應調用外部工具。先前研究自適應工具使用的文獻，多半將工具必要性視為與模型無關的屬性，由人類或LLM評判者標註，且主要涵蓋答案顯而易見的案例（例如查詢天氣 vs. 改寫文本）。然而，現實中的工具必要性因模型之間能力邊界的差異而更加複雜：一個強模型能獨自解決的問題，對弱模型而言可能仍需藉助工具。在本研究中，我們提出一種基於模型自適應的工具必要性定義，以各模型的實證表現為基礎。遵循此定義，我們比較四個模型在算術與事實性問答資料集上的必要性與實際工具調用行為，發現分別存在 26.5-54.0% 與 30.8-41.8% 的顯著不一致。為診斷此失敗，我們將工具使用分解為兩個階段：反映模型是否認為需要工具的內部認知階段，以及決定模型是否實際執行工具調用動作的執行階段。透過探測LLM的隱藏狀態，我們發現這兩個訊號通常可線性解碼，但在驅動下一個詞元動作的後期層、最後詞元區域中，其探測方向近乎正交。透過追蹤樣本在兩階段過程中的軌跡，我們進一步發現大多數不一致集中在認知到行動的轉換階段，而非認知本身。這些結果揭示了LLM工具使用中的「知行差距」：提升工具使用可靠性不僅需要更好地識別何時需要工具，還需要更好地將該識別轉化為實際行動。

English

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.