TurkColBERT：土耳其语信息检索的稠密与延迟交互模型基准

摘要

神经信息检索系统在高资源语言中表现卓越，但对土耳其语这类形态丰富、资源相对匮乏的语言仍缺乏深入探索。稠密双编码器当前主导土耳其语信息检索，而保留词元级表征以实现细粒度匹配的延迟交互模型尚未得到系统评估。我们推出TurkColBERT——首个全面比较稠密编码器与延迟交互模型在土耳其语检索性能的基准框架。通过两阶段适配流程：先在土耳其语NLI/STS任务上微调英语及多语言编码器，再利用基于MS MARCO-TR训练的PyLate将其转换为ColBERT风格检索器。我们在覆盖科学、金融及论证领域的五个土耳其语BEIR数据集上评估了10个模型。结果表明参数效率优势显著：仅含1.0M参数的colbert-hash-nano-tr比600M参数的turkish-e5-large稠密编码器缩小600倍，却保持其平均mAP的71%以上。参数量比稠密编码器小3-5倍的延迟交互模型显著优于后者，其中ColmmBERT-base-TR在特定领域任务上实现最高13.8%的mAP提升。针对生产环境需求，我们比较索引算法：MUVERA+重排序比PLAID快3.33倍，并带来1.7%的相对mAP提升。这使得ColmmBERT-base-TR在MUVERA下实现0.54毫秒查询延迟的低延迟检索。我们已开源所有检查点、配置及评估脚本。当前局限包括依赖中等规模数据集（≤5万文档）及翻译基准，可能无法完全反映真实土耳其语检索环境；更大规模的MUVERA评估仍有待开展。

English

Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600times smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5times smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33times faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets (leq50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.

TurkColBERT：土耳其语信息检索的稠密与延迟交互模型基准

TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

摘要

Support