CoHyDE: 도구 검색을 위한 LLM 재작성기 및 밀집 인코더의 반복적 공동 훈련

초록

대규모 API 카탈로그에서의 도구 검색은 LLM 에이전트의 핵심 병목 현상입니다. 사용자 질의는 구어체의, 종종 불충분하게 명시된 언어로 도착하는 반면, 카탈로그는 고정 인코더만으로는 자체적으로 연결할 수 없는 기술적인 API 어휘를 사용합니다. 두 가지 지배적인 훈련 접근 방식인 대조적 인코더 미세 조정과 고정된 LLM을 사용하는 HyDE 스타일 질의 확장은 이 문제를 반대쪽 끝에서 접근하며 상호 보완적인 방향으로 실패합니다. 미세 조정된 인코더는 질의의 표면 형태가 이미 카탈로그와 일치할 때 뛰어나지만 그렇지 않을 때는 붕괴되는 반면, 제로샷 HyDE는 불충분하게 명시된 질의에 더 강건하지만 질의가 잘 형성되었을 때 검색을 저하시키는 카탈로그를 인식하지 못하는 가상 설명을 생성합니다. 우리는 밀집 인코더와 LLM 재작성기를 단일 공진화 시스템으로 훈련하는 반복 절차인 CoHyDE를 소개합니다. 인코더는 재작성기가 생성한 카탈로그 스타일의 가상 설명에 대해 InfoNCE로 재훈련되고, 재작성기는 인코더의 검색 점수에 대해 DPO를 통해 선호도 정렬되며, 루프가 시작되기 전에 양쪽 모두 도구 카탈로그에서 웜 스타트됩니다. ToolBench 카탈로그의 약 10k 도구 하위 집합에서 세 번의 CoHyDE 라운드는 가장 강력한 단일 구성 요소 기준선보다 표준 질의에서 +2.5 pp NDCG@5, 보류된 모호한 질의에서 +6.3 pp 향상되었으며, 가장 어려운 모호한 계층에서는 최대 +8 pp의 이득을 보였습니다. 절제 실험을 통해 공동 훈련이 핵심 요소임을 확인했습니다. 단일 구성 요소를 독립적으로 사용하면 잘 형성된 질의와 모호한 질의 모두에서 CoHyDE에 미치지 못하며, 모호한 질의에서는 최대 -8 pp의 손실이 발생합니다.

English

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.