ToolSense: LLMのパラメトリックなツール知識を評価するための診断フレームワーク

要旨

大規模ツールカタログを備えたエージェントとして展開される大規模言語モデルは、重大なツール検索ボトルネックに直面する。埋め込みベースの検索手法は、特殊なツールのセマンティクスを十分に捉えられない可能性があるコンパクトなエンコーダに依存するため、パラメトリックツール検索は、各ツールをLLMの語彙に追加される仮想トークンとしてエンコードし、2段階（記憶化、次いで検索SFT）でファインチューニングすることで、LLMを検索器として利用し、標準的なToolBench検索ベンチマークで強力な性能を達成する。しかし、これらのベンチマークは冗長で完全に指定されたクエリを使用し、その評価は出力を有効なトークンパスに制限する制約付きデコーディングを適用しており、モデルが実際にツールを理解しているかどうかを明らかにしない。我々は、任意のツールカタログを入力として受け取り、3つのベンチマーク（3つの曖昧さレベルを持つクエリを含む現実的検索ベンチマーク（RRB）、MCQプロービングベンチマーク、QAプロービングベンチマーク）を自動生成する、オープンソースのLLM駆動型診断フレームワークであるToolSenseを導入する。ToolSenseをToolBench（約4万7千のツール）に適用し、5つのパラメトリックモデル学習構成を評価した結果、知識と検索の乖離が明らかになった。RRBクエリでは、いくつかの構成が完全指定のToolBenchベンチマークと比較して約50～64パーセントポイント低下し、埋め込みモデルのベースラインを下回った。さらに、強力な検索性能にもかかわらず、一部のモデルは事実に関するプローブでランダムに近いスコアを示し、知識と検索の乖離を示唆している。我々はToolSenseフレームワークとToolBench診断ベンチマークをhttps://github.com/SAP/toolsenseでオープンソースとして公開する。

English

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce ToolSense, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.