ChatPaper.aiChatPaper

JaColBERTv2.5:優化多向量檢索器以建立具有受限資源的最先進日文檢索器

JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources

July 30, 2024
作者: Benjamin Clavié
cs.AI

摘要

神經資訊檢索在高資源語言中取得了快速進展,但在日語等低資源語言中,由於資料稀缺等挑戰,進展受到阻礙。因此,儘管多語言模型存在計算效率低和無法捕捉語言細微差異等問題,仍主導了日語檢索領域。近期出現的像JaColBERT這樣的多向量單語模型已經縮小了這一差距,但在大規模評估中仍遠遠落後於多語言方法。本研究針對低資源環境下多向量檢索器的次優訓練方法進行了系統評估,重點放在日語上。我們系統性地評估和改進了JaColBERT的推理和訓練設置的關鍵方面,更廣泛地說,是多向量模型。通過一個新穎的檢查點合併步驟進一步提升性能,展示了將微調的好處與原始檢查點的泛化能力結合的有效方法。基於我們的分析,我們提出了一個新穎的訓練配方,產生了JaColBERTv2.5模型。JaColBERTv2.5只有1.1億個參數,在4個A100 GPU上不到15小時的訓練時間內,明顯優於所有現有方法,在所有常見基準測試中取得了平均得分0.754,遠高於之前的最佳0.720。為了支持未來的研究,我們公開提供了我們的最終模型、中間檢查點和所有使用的數據。
English
Neural Information Retrieval has advanced rapidly in high-resource languages, but progress in lower-resource ones such as Japanese has been hindered by data scarcity, among other challenges. Consequently, multilingual models have dominated Japanese retrieval, despite their computational inefficiencies and inability to capture linguistic nuances. While recent multi-vector monolingual models like JaColBERT have narrowed this gap, they still lag behind multilingual methods in large-scale evaluations. This work addresses the suboptimal training methods of multi-vector retrievers in lower-resource settings, focusing on Japanese. We systematically evaluate and improve key aspects of the inference and training settings of JaColBERT, and more broadly, multi-vector models. We further enhance performance through a novel checkpoint merging step, showcasing it to be an effective way of combining the benefits of fine-tuning with the generalization capabilities of the original checkpoint. Building on our analysis, we introduce a novel training recipe, resulting in the JaColBERTv2.5 model. JaColBERTv2.5, with only 110 million parameters and trained in under 15 hours on 4 A100 GPUs, significantly outperforms all existing methods across all common benchmarks, reaching an average score of 0.754, significantly above the previous best of 0.720. To support future research, we make our final models, intermediate checkpoints and all data used publicly available.

Summary

AI-Generated Summary

PDF222November 28, 2024