검색을 넘어서: 코드 검색을 위한 멀티태스크 벤치마크 및 모델

초록

코드 검색은 일반적으로 1차 검색(first-stage retrieval)으로 평가되어 왔으나, 실제 운영 환경에서는 재순위화(reranking)와 개발자 스타일 질의를 포함한 더 포괄적인 파이프라인에 의존한다. 기존 벤치마크 또한 데이터 오염, 레이블 노이즈, 퇴화된 이진 관련성(degenerate binary relevance)의 문제를 겪고 있다. 본 논문에서는 검색을 넘어 전체 코드 검색 파이프라인을 포괄하는, 오염이 제한된 멀티태스크 코드 검색 및 재순위화 벤치마크인 CoREB와 미세 조정된 코드 재순위화기를 함께 제안한다. CoREB는 5개 프로그래밍 언어로 반사실적으로 재작성된 LiveCodeBench 문제를 기반으로 구축되었으며, 시간별 릴리스 형태로 등급별 관련성 판단을 제공한다. 우리는 텍스트-코드, 코드-텍스트, 코드-코드의 세 가지 과제에 걸쳐 11개의 임베딩 모델과 5개의 재순위화기를 벤치마킹했다. 실험 결과는 다음과 같은 사실을 밝혀낸다: (1) 코드에 특화된 임베딩이 코드-코드 검색에서 우세하지만(일반 인코더 대비 약 2배), 어떤 단일 모델도 세 가지 과제 모두에서 우위를 점하지는 못한다. (2) 실제 개발자 검색에 가장 가까운 형식인 짧은 키워드 질의에서는 모든 모델의 nDCG@10이 거의 0으로 붕괴된다. (3) 기성 재순위화기는 과제 비대칭성을 보여, 코드-코드에서 12포인트 차이를 보이며 모든 과제에서 순긍정(net-positive)을 달성하는 기준선은 없다. (4) 미세 조정된 CoREB-Reranker는 세 가지 과제 모두에서 일관된 성능 향상을 달성한 최초의 모델이다. 데이터와 모델은 공개된다.

English

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce CoREB, a contamination-limited, multitask code retrieval and reranking benchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. CoREB is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval ({sim}2{times} over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned CoREB-Reranker is the first to achieve consistent gains across all three tasks. The data and model are released.