JaColBERTv2.5:优化多向量检索器,利用有限资源打造最先进的日文检索器
JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources
July 30, 2024
作者: Benjamin Clavié
cs.AI
摘要
神经信息检索在高资源语言中迅速发展,但在日语等低资源语言中,数据稀缺等挑战阻碍了进展。因此,尽管多语言模型存在计算效率低和无法捕捉语言细微差别等问题,但在日语检索中占主导地位。虽然最近的多向量单语模型如JaColBERT已经缩小了这一差距,但它们在大规模评估中仍落后于多语言方法。本研究针对低资源环境中多向量检索器的次优训练方法,重点关注日语。我们系统评估和改进了JaColBERT的推理和训练设置的关键方面,更广泛地说,是多向量模型。我们通过一种新颖的检查点合并步骤进一步提高性能,展示了它是将微调的好处与原始检查点的泛化能力相结合的有效方法。基于我们的分析,我们引入了一种新颖的训练配方,产生了JaColBERTv2.5模型。JaColBERTv2.5仅有1.1亿参数,在4个A100 GPU上不到15小时的训练时间内,显著优于所有现有方法,在所有常见基准测试中达到了平均得分0.754,明显高于之前的最佳得分0.720。为了支持未来研究,我们公开提供我们的最终模型、中间检查点和所有使用的数据。
English
Neural Information Retrieval has advanced rapidly in high-resource languages,
but progress in lower-resource ones such as Japanese has been hindered by data
scarcity, among other challenges. Consequently, multilingual models have
dominated Japanese retrieval, despite their computational inefficiencies and
inability to capture linguistic nuances. While recent multi-vector monolingual
models like JaColBERT have narrowed this gap, they still lag behind
multilingual methods in large-scale evaluations. This work addresses the
suboptimal training methods of multi-vector retrievers in lower-resource
settings, focusing on Japanese. We systematically evaluate and improve key
aspects of the inference and training settings of JaColBERT, and more broadly,
multi-vector models. We further enhance performance through a novel checkpoint
merging step, showcasing it to be an effective way of combining the benefits of
fine-tuning with the generalization capabilities of the original checkpoint.
Building on our analysis, we introduce a novel training recipe, resulting in
the JaColBERTv2.5 model. JaColBERTv2.5, with only 110 million parameters and
trained in under 15 hours on 4 A100 GPUs, significantly outperforms all
existing methods across all common benchmarks, reaching an average score of
0.754, significantly above the previous best of 0.720. To support future
research, we make our final models, intermediate checkpoints and all data used
publicly available.Summary
AI-Generated Summary