HAKARI-Bench：统一条件下比较检索架构与效率设置的轻量级基准

摘要

随着检索增强生成与语义搜索的快速普及，选择合适的嵌入和检索配置日益困难。大型检索基准测试虽全面，但在开发过程中重新运行代价过高，且缺乏在相同条件下跨多模型比较生产环境配置（如降维、量化、重排序）的基础设施。为此，我们提出HAKARI-Bench——一个轻量级基准测试框架，将现有检索套件重构为小型数据集（Nano-sets）：以统一格式覆盖43种语言的35个基准测试与551项任务，支持在相同条件下、不依赖具体模型地比较五大检索家族（BM25、稠密检索、稀疏检索、延迟交互、重排序器）及其效率变体。在55个模型上，其整体排名与官方MTEB检索v2、MMTEB检索v2及英文BEIR（完整版）的斯皮尔曼相关系数均高于0.97。HAKARI-Bench并非替代完整评估，而是实现快速模型选择、回归检测以及解读质量-效率帕累托前沿。代码、数据和排行榜均采用MIT许可协议发布。

English

With the rapid spread of retrieval-augmented generation and semantic search, choosing the right embedding and retrieval configuration is increasingly hard. Large retrieval benchmarks are comprehensive but too heavy to rerun during development, and there is little infrastructure for comparing production settings--dimensionality reduction, quantization, reranking--across many models under identical conditions. We present HAKARI-Bench, a lightweight benchmark that reconstructs existing retrieval suites into small datasets (Nano-sets): 35 benchmarks and 551 tasks across 43 languages in a unified format, enabling same-condition, model-agnostic comparison of five retrieval families (BM25, dense, sparse, late interaction, rerankers) and their efficiency variants. Across 55 models, its overall ranking reproduces the official MTEB retrieval v2, MMTEB v2 retrieval, and English BEIR (full) at Spearman >0.97. HAKARI-Bench does not replace full evaluation; it enables rapid model selection, regression detection, and reading the quality-efficiency Pareto frontier. Code, data, and leaderboard are released under the MIT license.