CoHyDE：用于工具检索的LLM改写器与密集编码器的迭代协同训练

摘要

从大型API目录中检索工具是LLM代理的一个核心瓶颈：用户查询以口语化且经常表述模糊的语言呈现，而目录却使用技术性的API术语，固定的编码器本身无法弥合这一鸿沟。两种主流的训练方法——对比编码器微调和基于冻结LLM的HyDE式查询扩展——从相反的方向解决该问题，但在互补的方面失效：当查询的表面形式与目录匹配时，微调后的编码器表现出色，但在不匹配时则性能崩溃；而零样本HyDE对表述模糊的查询更鲁棒，但生成的假设性描述脱离目录，当查询表述清晰时反而降低检索效果。我们提出CoHyDE，一种迭代方法，将密集编码器和LLM改写器训练为单一共同进化系统：编码器通过InfoNCE在改写器生成的目录风格假设性描述上重新训练，而改写器则通过DPO根据编码器的检索分数进行偏好对齐，双方在循环开始前均基于工具目录进行冷启动。在ToolBench目录的约1万个工具子集上，三轮CoHyDE在标准查询上将最强单一组件基线的NDCG@5提升了+2.5个百分点，在保留的模糊查询上提升了+6.3个百分点，在难度最高的模糊查询层级上提升幅度高达+8个百分点。消融实验证实，联合训练是关键因素：单独使用任一组件都无法在清晰查询和模糊查询上匹敌CoHyDE，在模糊查询上性能损失高达-8个百分点。

English

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.