HAKARI-Bench: 통일된 조건에서 검색 아키텍처 및 효율성 설정을 비교하기 위한 경량 벤치마크

초록

검색 증강 생성과 의미 기반 검색이 빠르게 확산됨에 따라 적절한 임베딩 및 검색 구성을 선택하는 것이 점점 더 어려워지고 있다. 대규모 검색 벤치마크는 포괄적이지만 개발 과정에서 재실행하기에는 너무 무겁고, 동일한 조건에서 다양한 모델에 걸쳐 차원 축소, 양자화, 재순위화 같은 프로덕션 설정을 비교할 수 있는 인프라는 거의 없다. 본 논문에서는 기존 검색 스위트를 작은 데이터셋(나노셋)으로 재구성한 경량 벤치마크인 HAKARI-Bench를 제시한다. 이는 35개 벤치마크와 43개 언어에 걸친 551개 태스크를 통일된 형식으로 제공하여, 동일 조건에서 모델에 구애받지 않는 다섯 가지 검색 계열(BM25, 밀집, 희소, 후기 상호작용, 재순위화기)과 그 효율성 변형 간 비교를 가능하게 한다. 55개 모델에 대해 HAKARI-Bench의 전체 순위는 공식 MTEB 검색 v2, MMTEB v2 검색, 영어 BEIR(전체)을 Spearman 상관계수 >0.97로 재현한다. HAKARI-Bench는 전체 평가를 대체하지 않으며, 신속한 모델 선택, 회귀 탐지, 품질-효율 파레토 최적 경계 파악을 가능하게 한다. 코드, 데이터, 리더보드는 MIT 라이선스로 공개된다.

English

With the rapid spread of retrieval-augmented generation and semantic search, choosing the right embedding and retrieval configuration is increasingly hard. Large retrieval benchmarks are comprehensive but too heavy to rerun during development, and there is little infrastructure for comparing production settings--dimensionality reduction, quantization, reranking--across many models under identical conditions. We present HAKARI-Bench, a lightweight benchmark that reconstructs existing retrieval suites into small datasets (Nano-sets): 35 benchmarks and 551 tasks across 43 languages in a unified format, enabling same-condition, model-agnostic comparison of five retrieval families (BM25, dense, sparse, late interaction, rerankers) and their efficiency variants. Across 55 models, its overall ranking reproduces the official MTEB retrieval v2, MMTEB v2 retrieval, and English BEIR (full) at Spearman >0.97. HAKARI-Bench does not replace full evaluation; it enables rapid model selection, regression detection, and reading the quality-efficiency Pareto frontier. Code, data, and leaderboard are released under the MIT license.