CoHyDE：用於工具檢索的大語言模型改寫器與密集編碼器的迭代共訓練

摘要

在大型 API 目錄中進行工具檢索是 LLM 代理的核心瓶頸：使用者查詢以口語化、常常欠指定的語言形式出現，而目錄則使用技術性的 API 詞彙，任何固定的編碼器都無法獨自彌合兩者的差距。目前主流的兩種訓練方法——對比編碼器微調與使用凍結 LLM 進行 HyDE 風格的查詢擴展——從相反方向處理此問題，並在互補層面上失敗：微調後的編碼器在查詢的表面形式與目錄匹配時表現出色，但一旦不匹配則效能崩潰；而零樣本 HyDE 對欠指定查詢更穩健，卻會生成與目錄無關的假設性描述，導致查詢結構良好時檢索效能下降。我們提出 CoHyDE，這是一種迭代程序，將稠密編碼器與 LLM 改寫器訓練為一個共同進化的系統：編碼器使用改寫器所產生的目錄風格假設性描述，以 InfoNCE 重新訓練；改寫器則透過 DPO 根據編碼器的檢索分數進行偏好對齊，兩者在循環開始前均在工具目錄上進行熱啟動。在 ToolBench 目錄約 10k 工具的子集上，經過三輪 CoHyDE 訓練後，標準查詢的 NDCG@5 比最強的單一組件基線提升了 +2.5 個百分點，而在保留的模糊查詢上提升了 +6.3 個百分點，其中在難度最高的模糊查詢層級上，提升幅度高達 +8 個百分點。消融實驗證實，共同訓練是關鍵要素：單獨使用任一組件都無法在結構良好與模糊查詢上匹敵 CoHyDE，在模糊查詢上的損失可達 -8 個百分點。

English

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.