模型自适应工具必要性揭示LLM工具使用中的知行差距

摘要

大语言模型（LLMs）日益成为自主智能体，需自行判断何时直接作答、何时调用外部工具。先前关于自适应工具使用的研究多将工具必要性视为与模型无关的属性，由人工或LLM评判标注，且主要覆盖答案显而易见的情形（例如获取天气信息 vs 文本改写）。然而，现实场景中的工具必要性因模型能力边界差异而更显微妙：强模型能独立解决的问题，对弱模型仍可能需要工具辅助。本研究提出一种基于模型自适应定义的"工具必要性"，该定义以各模型的经验性能为基础。我们依据该定义，在算术和事实问答数据集上比较四种模型实际工具调用行为与必要性的吻合度，发现显著偏差分别达26.5%-54.0%和30.8%-41.8%。为诊断失败原因，我们将工具使用分解为两个阶段：反映模型是否认为需要工具的内部认知阶段，以及决定模型是否实际执行工具调用的操作阶段。通过探测LLM隐藏状态，我们发现两种信号通常可线性解码，但在驱动下一词元输出的后层最后一个词元区间内，其探测方向近乎正交。通过追踪样本在两阶段过程中的轨迹，我们进一步发现大部分偏差集中在从认知到行动的转换环节，而非认知本身。这些结果揭示了LLM工具使用中的"知行差距"：提升工具使用可靠性不仅需要更好地识别何时需要工具，还需更有效地将该识别转化为实际行动。

English

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.