TurkColBERT：土耳其語信息檢索的稠密與遲交互模型基準評測

摘要

神經資訊檢索系統在高資源語言中表現卓越，但對土耳其語這類形態豐富的低資源語言的研究仍顯不足。當前土耳其語IR領域主要採用稠密雙編碼器，而保留詞元級表徵以實現細粒度匹配的延遲交互模型尚未得到系統性評估。我們推出TurkColBERT——首個全面比較土耳其語檢索中稠密編碼器與延遲交互模型的基準框架。通過兩階段適應流程：先在土耳其語NLI/STS任務上微調英語及多語言編碼器，再利用MS MARCO-TR訓練的PyLate將其轉換為ColBERT風格檢索器。我們在涵蓋科學、金融及論證領域的五個土耳其語BEIR數據集上評估10個模型。結果顯示卓越的參數效率：參數量僅1.0M的colbert-hash-nano-tr比600M的turkish-e5-large稠密編碼器縮小600倍，卻保持其平均mAP的71%以上。參數量比稠密編碼器少3-5倍的延遲交互模型顯著優於後者，ColmmBERT-base-TR在特定領域任務中mAP提升達+13.8%。針對生產環境需求，我們比較索引算法：MUVERA+重排比PLAID快3.33倍，並實現+1.7%相對mAP提升。這使得ColmmBERT-base-TR在MUVERA下達到0.54毫秒查詢延遲的低延遲檢索。我們公開所有檢查點、配置及評估腳本。局限性包括依賴中等規模數據集（≤5萬文檔）及翻譯基準，可能無法完全反映真實土耳其語檢索環境；大規模MUVERA評估仍有待開展。

English

Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600times smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5times smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33times faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets (leq50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.

TurkColBERT：土耳其語信息檢索的稠密與遲交互模型基準評測

TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

摘要

Support