请勿复制粘贴！代码检索的重写策略

摘要

基于嵌入的代码检索常因编码器对表面语法的过拟合而性能受损。先前研究通过利用大语言模型（LLM）将查询和语料库改写为规范化风格来缓解此问题，但仍有两大问题悬而未决：表示偏移的优化程度应如何把握？以及每查询调用LLM的合理性何时成立？我们研究了三种改写策略的层级结构——风格化改写、NL增强型伪代码与完整自然语言转录——并分别在查询-语料联合增强（在线模式）与仅语料增强（离线模式）下，跨六个CoIR基准、五种编码器及三类独立模型族（Qwen、DeepSeek、Mistral）的重写器进行实验。我们首次将NL增强型伪代码与片段级自然语言作为直接检索表示（而非临时中间体）进行评估。查询-语料联合的完整自然语言改写带来最大增益（MoSE-18在CT-Contest上NDCG@10绝对值提升0.51），而仅语料改写则在90组配置的56组（约62%）中导致检索性能下降。我们引入两项诊断指标——熵差（Delta H）与嵌入余弦差（Delta s），并证明在全部三类重写器族中，Delta H能预测查询-语料联合增强下的检索增益：DeepSeek+Codestral的合并斯皮尔曼相关系数ρ=+0.436（p<0.001），Codestral单独为ρ=+0.593，Qwen为ρ=+0.356。由此确立Delta H作为一种廉价且与重写器无关的代理指标，可在运行检索前判断改写是否值得。我们的分析将LLM改写重新定义为一种成本效益决策：作为轻量级编码器在代码主导查询上的修正层最为有效，而对强编码器或自然语言密集型查询的增益则呈现递减趋势。

English

Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and corpora into a normalized style, but leaves two questions open: how much representational shift helps, and when is the per-query LLM call justified? We study a hierarchy of three rewriting strategies: stylistic rephrasing, NL-enriched PseudoCode, and full Natural-Language transcription, under joint query-corpus (QC, online) and corpus-only (C, offline) augmentation, across six CoIR benchmarks, five encoders, and three rewriters spanning independent model families (Qwen, DeepSeek, Mistral). We are the first to evaluate NL-enriched PseudoCode and snippet-level Natural Language as direct retrieval representations, rather than as transient intermediates. Full NL rewriting with QC yields the largest gains (+0.51 absolute NDCG@10 on CT-Contest for MoSE-18), while corpus-only rewriting degrades retrieval in 56 of 90 configurations, about 62%. We introduce two diagnostics, Delta H, token entropy, and Delta s, embedding cosine, and show that Delta H predicts retrieval gain under QC across all three rewriter families: pooled Spearman rho = +0.436, p < 0.001 on DeepSeek+Codestral; rho = +0.593 on Codestral alone; rho = +0.356 on Qwen. This establishes Delta H as a cheap, rewriter-agnostic proxy for deciding when rewriting pays off before running retrieval. Our analysis reframes LLM rewriting as a cost-benefit decision: it is most effective as a remediation layer for lightweight encoders on code-dominant queries, with diminishing returns for strong encoders or NL-heavy queries.

请勿复制粘贴！代码检索的重写策略

Do not copy and paste! Rewriting strategies for code retrieval

摘要

Support