JaColBERTv2.5:優化多向量檢索器以建立具有受限資源的最先進日文檢索器
JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources
July 30, 2024
作者: Benjamin Clavié
cs.AI
摘要
神經資訊檢索在高資源語言中取得了快速進展,但在日語等低資源語言中,由於資料稀缺等挑戰,進展受到阻礙。因此,儘管多語言模型存在計算效率低和無法捕捉語言細微差異等問題,仍主導了日語檢索領域。近期出現的像JaColBERT這樣的多向量單語模型已經縮小了這一差距,但在大規模評估中仍遠遠落後於多語言方法。本研究針對低資源環境下多向量檢索器的次優訓練方法進行了系統評估,重點放在日語上。我們系統性地評估和改進了JaColBERT的推理和訓練設置的關鍵方面,更廣泛地說,是多向量模型。通過一個新穎的檢查點合併步驟進一步提升性能,展示了將微調的好處與原始檢查點的泛化能力結合的有效方法。基於我們的分析,我們提出了一個新穎的訓練配方,產生了JaColBERTv2.5模型。JaColBERTv2.5只有1.1億個參數,在4個A100 GPU上不到15小時的訓練時間內,明顯優於所有現有方法,在所有常見基準測試中取得了平均得分0.754,遠高於之前的最佳0.720。為了支持未來的研究,我們公開提供了我們的最終模型、中間檢查點和所有使用的數據。
English
Neural Information Retrieval has advanced rapidly in high-resource languages,
but progress in lower-resource ones such as Japanese has been hindered by data
scarcity, among other challenges. Consequently, multilingual models have
dominated Japanese retrieval, despite their computational inefficiencies and
inability to capture linguistic nuances. While recent multi-vector monolingual
models like JaColBERT have narrowed this gap, they still lag behind
multilingual methods in large-scale evaluations. This work addresses the
suboptimal training methods of multi-vector retrievers in lower-resource
settings, focusing on Japanese. We systematically evaluate and improve key
aspects of the inference and training settings of JaColBERT, and more broadly,
multi-vector models. We further enhance performance through a novel checkpoint
merging step, showcasing it to be an effective way of combining the benefits of
fine-tuning with the generalization capabilities of the original checkpoint.
Building on our analysis, we introduce a novel training recipe, resulting in
the JaColBERTv2.5 model. JaColBERTv2.5, with only 110 million parameters and
trained in under 15 hours on 4 A100 GPUs, significantly outperforms all
existing methods across all common benchmarks, reaching an average score of
0.754, significantly above the previous best of 0.720. To support future
research, we make our final models, intermediate checkpoints and all data used
publicly available.Summary
AI-Generated Summary