복사하여 붙여넣지 마세요! 코드 검색을 위한 재작성 전략

초록

임베딩 기반 코드 검색은 종종 인코더가 표면 구문에 과적합될 때 성능이 저하된다. 기존 연구는 LLM을 활용하여 쿼리와 코퍼스를 정규화된 스타일로 재작성함으로써 이 문제를 완화했지만, 두 가지 질문이 남아 있다: 표현적 변화가 어느 정도 도움이 되는지, 그리고 쿼리별 LLM 호출이 언제 정당화되는지이다. 본 연구에서는 세 가지 재작성 전략의 계층 구조를 분석한다: 스타일 재작성, 자연어 강화 의사코드(NL-enriched PseudoCode), 그리고 완전 자연어 변환(full Natural-Language transcription). 이를 쿼리-코퍼스 공동(QC, 온라인) 증강과 코퍼스 단독(C, 오프라인) 증강 조건에서, 여섯 개의 CoIR 벤치마크, 다섯 개의 인코더, 그리고 세 개의 재작성기(각각 독립적인 모델 패밀리인 Qwen, DeepSeek, Mistral)에 걸쳐 평가한다. 본 연구는 NL-강화 의사코드와 스니펫 수준 자연어를 일시적 중간 표현이 아닌 직접적인 검색 표현으로 평가한 첫 번째 연구이다. QC 기반 완전 NL 재작성은 가장 큰 성능 향상(CT-Contest에서 MoSE-18의 절대 NDCG@10 0.51 증가)을 보인 반면, 코퍼스 단독 재작성은 90개 구성 중 56개(약 62%)에서 검색 성능을 저하시켰다. 우리는 두 가지 진단 지표, 즉 토큰 엔트로피인 Delta H와 임베딩 코사인 유사도인 Delta s를 도입하며, Delta H가 세 재작기 패밀리 모두에서 QC 조건의 검색 성능 향상을 예측함을 보인다: DeepSeek+Codestral에서 통합 스피어만 ρ = +0.436, p < 0.001; Codestral 단독에서 ρ = +0.593; Qwen에서 ρ = +0.356. 이는 Delta H를 검색 실행 전에 재작성의 효용성을 결정하는 저비용의 재작기 무관 대리 지표로 확립한다. 본 분석은 LLM 재작성을 비용-효율 결정으로 재구성한다: 재작성은 경량 인코더와 코드 중심 쿼리에서 보정 계층으로 가장 효과적이며, 강력한 인코더나 자연어 중심 쿼리에서는 수확 체감이 발생한다.

English

Embedding-based code retrieval often suffers when encoders overfit to surface syntax. Prior work mitigates this by using LLMs to rephrase queries and corpora into a normalized style, but leaves two questions open: how much representational shift helps, and when is the per-query LLM call justified? We study a hierarchy of three rewriting strategies: stylistic rephrasing, NL-enriched PseudoCode, and full Natural-Language transcription, under joint query-corpus (QC, online) and corpus-only (C, offline) augmentation, across six CoIR benchmarks, five encoders, and three rewriters spanning independent model families (Qwen, DeepSeek, Mistral). We are the first to evaluate NL-enriched PseudoCode and snippet-level Natural Language as direct retrieval representations, rather than as transient intermediates. Full NL rewriting with QC yields the largest gains (+0.51 absolute NDCG@10 on CT-Contest for MoSE-18), while corpus-only rewriting degrades retrieval in 56 of 90 configurations, about 62%. We introduce two diagnostics, Delta H, token entropy, and Delta s, embedding cosine, and show that Delta H predicts retrieval gain under QC across all three rewriter families: pooled Spearman rho = +0.436, p < 0.001 on DeepSeek+Codestral; rho = +0.593 on Codestral alone; rho = +0.356 on Qwen. This establishes Delta H as a cheap, rewriter-agnostic proxy for deciding when rewriting pays off before running retrieval. Our analysis reframes LLM rewriting as a cost-benefit decision: it is most effective as a remediation layer for lightweight encoders on code-dominant queries, with diminishing returns for strong encoders or NL-heavy queries.

복사하여 붙여넣지 마세요! 코드 검색을 위한 재작성 전략

Do not copy and paste! Rewriting strategies for code retrieval

초록

Support