HAKARI-Bench: 統一条件下で検索アーキテクチャと効率設定を比較するための軽量ベンチマーク

要旨

検索拡張生成とセマンティック検索の急速な普及に伴い、適切な埋め込みと検索構成を選択することがますます困難になっています。大規模な検索ベンチマークは包括的ですが、開発中に再実行するには負荷が大きすぎ、また同一条件下で多くのモデル間における次元削減、量子化、再ランキングといったプロダクション設定を比較するためのインフラはほとんどありません。本稿では、既存の検索スイートを小規模データセット（Nanoセット）に再構築した軽量ベンチマークであるHAKARI-Benchを提案します。これは35のベンチマークと43言語にわたる551のタスクを統一フォーマットで提供し、同一条件かつモデルに依存しない形で、5つの検索ファミリー（BM25、高密度検索、疎検索、後期相互作用モデル、再ランカー）とその効率バリアントの比較を可能にします。55モデル全体でのランキングは、公式のMTEB Retrieval v2、MMTEB v2 Retrieval、および英語BEIR（全文）をスピアマン相関係数>0.97で再現します。HAKARI-Benchは完全な評価を置き換えるものではなく、迅速なモデル選択、回帰検出、および品質と効率のパレートフロンティアの読み取りを可能にします。コード、データ、リーダーボードはMITライセンスのもとで公開されています。

English

With the rapid spread of retrieval-augmented generation and semantic search, choosing the right embedding and retrieval configuration is increasingly hard. Large retrieval benchmarks are comprehensive but too heavy to rerun during development, and there is little infrastructure for comparing production settings--dimensionality reduction, quantization, reranking--across many models under identical conditions. We present HAKARI-Bench, a lightweight benchmark that reconstructs existing retrieval suites into small datasets (Nano-sets): 35 benchmarks and 551 tasks across 43 languages in a unified format, enabling same-condition, model-agnostic comparison of five retrieval families (BM25, dense, sparse, late interaction, rerankers) and their efficiency variants. Across 55 models, its overall ranking reproduces the official MTEB retrieval v2, MMTEB v2 retrieval, and English BEIR (full) at Spearman >0.97. HAKARI-Bench does not replace full evaluation; it enables rapid model selection, regression detection, and reading the quality-efficiency Pareto frontier. Code, data, and leaderboard are released under the MIT license.