ToolSense：一种用于审计LLMs中参数化工具知识的诊断框架

摘要

部署于大型工具目录中的大语言模型（LLM）在作为智能体运行时，面临一个关键的工具检索瓶颈。由于基于嵌入的检索方法依赖的紧凑编码器可能难以充分捕捉特定工具语义，参数化工具检索通过将每个工具编码为附加到LLM词汇表中的虚拟词元（token）来解决这一问题，并采用两阶段微调（先记忆后检索的SFT），将LLM本身作为检索器使用，在标准ToolBench检索基准上取得了强劲性能。然而，这些基准使用的是冗长且完全明确的查询，其评估方法采用约束解码，仅允许输出有效的词元路径，这并不能揭示模型是否真正理解其工具。为此，我们提出ToolSense——一个基于LLM的开源诊断框架，该框架可将任意工具目录作为输入，并自动生成三个基准测试：一个包含三种模糊等级查询的“真实检索基准”（RRB）、一个多项选择（MCQ）探测基准，以及一个问答（QA）探测基准。当我们将ToolSense应用于ToolBench（约4.7万个工具）并评估五种参数化模型训练配置时，发现存在知识-检索分离现象：在RRB查询中，与完全明确的ToolBench基准相比，若干配置的性能崩溃了约50至64个百分点，甚至低于嵌入模型基线。此外，尽管检索性能强劲，部分模型在事实性探测任务上的得分接近随机水平，这进一步证实了知识-检索分离。我们已在https://github.com/SAP/toolsense开源了ToolSense框架及ToolBench诊断基准。

English

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce ToolSense, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.