CoHyDE: ツール検索のためのLLMリライターと高密度エンコーダの反復共訓練

要旨

大規模なAPIカタログにわたるツール検索は、LLMエージェントにとって核心的なボトルネックとなる。ユーザーのクエリは口語的で、しばしば曖昧な表現で届く一方、カタログは技術的なAPI語彙を使用しており、固定されたエンコーダだけではその溝を埋めることができない。対照的なエンコーダの微調整と、凍結されたLLMを用いたHyDEスタイルのクエリ拡張という2つの主要なトレーニング手法は、この問題を対極からアプローチし、相互補完的に失敗する。微調整されたエンコーダは、クエリの表面的な形式がすでにカタログと一致している場合に優れるが、一致しない場合には性能が崩壊する。一方、ゼロショットHyDEは曖昧なクエリに対してより頑健であるものの、カタログを考慮しない仮想的な記述を生成するため、クエリが適切に形成されている場合には検索性能を低下させる。我々は、疎なエンコーダとLLMリライターを単一の共進化システムとして訓練する反復手法CoHyDEを導入する。エンコーダは、リライターが生成したカタログスタイルの仮想的記述を用いてInfoNCEで再訓練され、リライターはエンコーダの検索スコアに対するDPOによって嗜好アライメントされる。両者はループ開始前にツールカタログでウォームスタートされる。ToolBenchカタログの約1万ツールのサブセットにおいて、3ラウンドのCoHyDEは、最も強力な単一コンポーネントベースラインに対して、標準クエリで+2.5ポイントのNDCG@5向上、未見の曖昧なクエリで+6.3ポイントの向上を示し、最も困難な曖昧クエリの階層では最大+8ポイントの向上を達成した。アブレーション実験により、共訓練が鍵となる要素であることが確認された。つまり、いずれかのコンポーネントを単独で使用すると、適切に形成されたクエリと曖昧なクエリの両方でCoHyDEに及ばず、曖昧なクエリでは最大-8ポイントの損失が生じる。

English

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.