超越检索：面向代码搜索的多任务基准与模型

摘要

代码搜索通常被评估为第一阶段检索，尽管实际生产系统依赖于更广泛的管道，包括重排序和开发者风格的查询。现有基准测试也存在数据污染、标签噪声和退化的二元相关性等问题。本文中，我们引入了CoREB——一个污染限制的多任务代码检索与重排序基准，并附带一个微调的代码重排序器，其功能超越检索，覆盖完整的代码搜索流水线。CoREB基于从五个编程语言中反事实改写的LiveCodeBench问题构建，并采用定时发布和分级相关性判断。我们在三项任务（文本到代码、代码到文本、代码到代码）上对11个嵌入模型和5个重排序器进行了基准测试。实验揭示：①代码专用嵌入在代码到代码检索中占据主导地位（比通用编码器高出约2倍），但没有任何单一模型在所有三项任务中获胜；②短关键词查询（最接近真实开发者搜索的格式）使所有模型在nDCG@10上几乎降至零；③现成的重排序器存在任务不对称性，在代码到代码任务上波动达12个百分点，且没有任何基线能在所有任务上实现净正增益；④我们微调的CoREB-Reranker是首个在所有三项任务上实现一致增益的模型。数据集和模型已公开发布。

English

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce CoREB, a contamination-limited, multitask code retrieval and reranking benchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. CoREB is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval ({sim}2{times} over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned CoREB-Reranker is the first to achieve consistent gains across all three tasks. The data and model are released.

超越检索：面向代码搜索的多任务基准与模型

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

摘要

Support