JaColBERTv2.5：制約されたリソースで最先端の日本語検索システムを構築するためのマルチベクトル検索モデルの最適化

要旨

ニューラル情報検索は高リソース言語において急速に進展してきたが、日本語のような低リソース言語では、データ不足などの課題により進展が妨げられてきた。その結果、計算効率の低さや言語的ニュアンスの捕捉能力の欠如にもかかわらず、多言語モデルが日本語検索を支配してきた。最近のJaColBERTのような多ベクトル単言語モデルはこのギャップを縮めてきたが、大規模評価では依然として多言語手法に遅れを取っている。本研究は、低リソース環境、特に日本語における多ベクトル検索モデルの最適でない訓練方法に取り組む。JaColBERT、そしてより広く多ベクトルモデルの推論および訓練設定の重要な側面を体系的に評価し、改善する。さらに、新しいチェックポイント統合ステップを通じて性能を向上させ、ファインチューニングの利点と元のチェックポイントの汎化能力を組み合わせる効果的な方法であることを示す。我々の分析に基づいて、新しい訓練レシピを導入し、JaColBERTv2.5モデルを開発した。JaColBERTv2.5は、1億1000万パラメータのみで、4つのA100 GPUで15時間未満の訓練を行い、すべての一般的なベンチマークで既存のすべての手法を大幅に上回り、平均スコア0.754を達成し、以前の最高記録0.720を大きく上回った。今後の研究を支援するため、最終モデル、中間チェックポイント、および使用したすべてのデータを公開している。

English

Neural Information Retrieval has advanced rapidly in high-resource languages, but progress in lower-resource ones such as Japanese has been hindered by data scarcity, among other challenges. Consequently, multilingual models have dominated Japanese retrieval, despite their computational inefficiencies and inability to capture linguistic nuances. While recent multi-vector monolingual models like JaColBERT have narrowed this gap, they still lag behind multilingual methods in large-scale evaluations. This work addresses the suboptimal training methods of multi-vector retrievers in lower-resource settings, focusing on Japanese. We systematically evaluate and improve key aspects of the inference and training settings of JaColBERT, and more broadly, multi-vector models. We further enhance performance through a novel checkpoint merging step, showcasing it to be an effective way of combining the benefits of fine-tuning with the generalization capabilities of the original checkpoint. Building on our analysis, we introduce a novel training recipe, resulting in the JaColBERTv2.5 model. JaColBERTv2.5, with only 110 million parameters and trained in under 15 hours on 4 A100 GPUs, significantly outperforms all existing methods across all common benchmarks, reaching an average score of 0.754, significantly above the previous best of 0.720. To support future research, we make our final models, intermediate checkpoints and all data used publicly available.

JaColBERTv2.5：制約されたリソースで最先端の日本語検索システムを構築するためのマルチベクトル検索モデルの最適化

JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources

要旨

Support