ToolSense: LLM의 매개변수 도구 지식 감사를 위한 진단 프레임워크

초록

대규모 도구 카탈로그를 에이전트로 배포된 대규모 언어 모델은 심각한 도구 검색 병목 현상에 직면한다. 임베딩 기반 검색 접근법은 전문화된 도구 의미를 충분히 포착하지 못할 수 있는 소형 인코더에 의존하기 때문에, 파라메트릭 도구 검색은 각 도구를 LLM 어휘에 추가된 가상 토큰으로 인코딩하고 두 단계(암기 후 검색 SFT)로 미세 조정하여 LLM을 검색기로 사용함으로써 이를 해결하며, 표준 ToolBench 검색 벤치마크에서 강력한 성능을 달성한다. 그러나 이러한 벤치마크는 상세하고 완전히 지정된 쿼리를 사용하며, 평가는 출력을 유효한 토큰 경로로 제한하는 제약 디코딩을 적용하므로 모델이 실제로 도구를 이해하는지 여부를 밝히지 않는다. 우리는 모든 도구 카탈로그를 입력으로 받아 세 가지 벤치마크(세 가지 모호성 수준의 쿼리가 있는 현실적 검색 벤치마크(RRB), 객관식 프로빙 벤치마크, 질의응답 프로빙 벤치마크)를 자동 생성하는 오픈소스 LLM 기반 진단 프레임워크 ToolSense를 소개한다. ToolSense를 ToolBench(약 47,000개 도구)에 적용하고 다섯 가지 파라메트릭 모델 학습 구성을 평가한 결과 지식-검색 분리가 드러났다. RRB 쿼리에서 여러 구성은 완전히 지정된 ToolBench 벤치마크 대비 약 50~64퍼센트 포인트 하락하여 임베딩 모델 기준선 아래로 떨어졌다. 또한 강력한 검색 성능에도 불구하고 일부 모델은 사실 프로빙에서 거의 무작위 수준의 점수를 기록하여 지식-검색 분리를 시사한다. 우리는 ToolSense 프레임워크와 ToolBench 진단 벤치마크를 https://github.com/SAP/toolsense에서 오픈소스로 공개한다.

English

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce ToolSense, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.