ChatPaper.aiChatPaper

请勿复制粘贴!代码检索的重写策略

Do not copy and paste! Rewriting strategies for code retrieval

May 8, 2026
作者: Andrea Gurioli, Federico Pennino, Maurizio Gabbrielli
cs.AI

摘要

基于嵌入的代码检索常因编码器对表面语法的过拟合而性能受损。先前研究通过利用大语言模型(LLM)将查询和语料库改写为规范化风格来缓解此问题,但仍有两大问题悬而未决:表示偏移的优化程度应如何把握?以及每查询调用LLM的合理性何时成立?我们研究了三种改写策略的层级结构——风格化改写、NL增强型伪代码与完整自然语言转录——并分别在查询-语料联合增强(在线模式)与仅语料增强(离线模式)下,跨六个CoIR基准、五种编码器及三类独立模型族(Qwen、DeepSeek、Mistral)的重写器进行实验。我们首次将NL增强型伪代码与片段级自然语言作为直接检索表示(而非临时中间体)进行评估。查询-语料联合的完整自然语言改写带来最大增益(MoSE-18在CT-Contest上NDCG@10绝对值提升0.51),而仅语料改写则在90组配置的56组(约62%)中导致检索性能下降。我们引入两项诊断指标——熵差(Delta H)与嵌入余弦差(Delta s),并证明在全部三类重写器族中,Delta H能预测查询-语料联合增强下的检索增益:DeepSeek+Codestral的合并斯皮尔曼相关系数ρ=+0.436(p<0.001),Codestral单独为ρ=+0.593,Qwen为ρ=+0.356。由此确立Delta H作为一种廉价且与重写器无关的代理指标,可在运行检索前判断改写是否值得。我们的分析将LLM改写重新定义为一种成本效益决策:作为轻量级编码器在代码主导查询上的修正层最为有效,而对强编码器或自然语言密集型查询的增益则呈现递减趋势。
English
Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and corpora into a normalized style, but leaves two questions open: how much representational shift helps, and when is the per-query LLM call justified? We study a hierarchy of three rewriting strategies: stylistic rephrasing, NL-enriched PseudoCode, and full Natural-Language transcription, under joint query-corpus (QC, online) and corpus-only (C, offline) augmentation, across six CoIR benchmarks, five encoders, and three rewriters spanning independent model families (Qwen, DeepSeek, Mistral). We are the first to evaluate NL-enriched PseudoCode and snippet-level Natural Language as direct retrieval representations, rather than as transient intermediates. Full NL rewriting with QC yields the largest gains (+0.51 absolute NDCG@10 on CT-Contest for MoSE-18), while corpus-only rewriting degrades retrieval in 56 of 90 configurations, about 62%. We introduce two diagnostics, Delta H, token entropy, and Delta s, embedding cosine, and show that Delta H predicts retrieval gain under QC across all three rewriter families: pooled Spearman rho = +0.436, p < 0.001 on DeepSeek+Codestral; rho = +0.593 on Codestral alone; rho = +0.356 on Qwen. This establishes Delta H as a cheap, rewriter-agnostic proxy for deciding when rewriting pays off before running retrieval. Our analysis reframes LLM rewriting as a cost-benefit decision: it is most effective as a remediation layer for lightweight encoders on code-dominant queries, with diminishing returns for strong encoders or NL-heavy queries.
PDF81May 14, 2026