請勿複製貼上！程式碼檢索的重寫策略

摘要

基于嵌入的代码检索常因编码器过度拟合表面语法而表现不佳。先前的研究通过利用大语言模型将查询和语料库重写为规范化风格来缓解这一问题，但仍有两个问题悬而未决：表征偏移的程度对检索有何帮助，以及每次查询调用大语言模型在何时是合理的？我们研究了三种重写策略的分层结构：风格化改写、自然语言增强的伪代码以及完整的自然语言转写，并在联合查询-语料库（在线）与仅语料库（离线）两种增强方式下，跨越六个CoIR基准、五种编码器以及三种分属独立模型家族（Qwen、DeepSeek、Mistral）的重写器进行了评估。我们是首个将自然语言增强的伪代码和片段级自然语言作为直接检索表示（而非瞬态中间步骤）进行评估的研究。在CT-Contest数据集上，采用联合查询-语料库增强的完整自然语言重写带来了最大性能提升（MoSE-18的NDCG@10绝对值提升0.51），而仅语料库重写则在90个配置中的56个（约62%）导致了检索性能下降。我们引入了两个诊断指标：Delta H（令牌熵）与Delta s（嵌入余弦），并证明在全部三个重写器家族中，Delta H均可预测联合查询-语料库增强下的检索增益：DeepSeek+Codestral的合并斯皮尔曼相关系数ρ = +0.436（p < 0.001），单独Codestral的ρ = +0.593，Qwen的ρ = +0.356。这确立了Delta H作为廉价且与重写器无关的代理指标，可在运行检索前判断重写是否值得。我们的分析将大语言模型重写重新定义为一个成本效益决策：它在作为轻量级编码器在代码主导型查询上的补救层时最为有效，而对于强力编码器或自然语言密集型查询，其收益则递减。

English

Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and corpora into a normalized style, but leaves two questions open: how much representational shift helps, and when is the per-query LLM call justified? We study a hierarchy of three rewriting strategies: stylistic rephrasing, NL-enriched PseudoCode, and full Natural-Language transcription, under joint query-corpus (QC, online) and corpus-only (C, offline) augmentation, across six CoIR benchmarks, five encoders, and three rewriters spanning independent model families (Qwen, DeepSeek, Mistral). We are the first to evaluate NL-enriched PseudoCode and snippet-level Natural Language as direct retrieval representations, rather than as transient intermediates. Full NL rewriting with QC yields the largest gains (+0.51 absolute NDCG@10 on CT-Contest for MoSE-18), while corpus-only rewriting degrades retrieval in 56 of 90 configurations, about 62%. We introduce two diagnostics, Delta H, token entropy, and Delta s, embedding cosine, and show that Delta H predicts retrieval gain under QC across all three rewriter families: pooled Spearman rho = +0.436, p < 0.001 on DeepSeek+Codestral; rho = +0.593 on Codestral alone; rho = +0.356 on Qwen. This establishes Delta H as a cheap, rewriter-agnostic proxy for deciding when rewriting pays off before running retrieval. Our analysis reframes LLM rewriting as a cost-benefit decision: it is most effective as a remediation layer for lightweight encoders on code-dominant queries, with diminishing returns for strong encoders or NL-heavy queries.

請勿複製貼上！程式碼檢索的重寫策略

Do not copy and paste! Rewriting strategies for code retrieval

摘要

Support