神奇(小型)检索器及其训练之道: mxbai-edge-colbert-v0 技术报告
Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0 Tech Report
October 16, 2025
作者: Rikiya Takehi, Benjamin Clavié, Sean Lee, Aamir Shakir
cs.AI
摘要
在本研究中,我们推出了mxbai-edge-colbert-v0模型,包含两种不同参数规模:1700万和3200万。作为研究的一部分,我们进行了大量实验以优化检索与后期交互模型,旨在将这些成果提炼为小型模型作为概念验证。我们的终极目标是支持全尺度的检索应用,从云端的大规模检索到能在任何设备上本地运行的模型。mxbai-edge-colbert-v0模型,我们期望其成为未来所有实验的坚实基础,标志着一系列小型概念验证模型的首个版本。在mxbai-edge-colbert-v0的开发过程中,我们执行了多项消融研究,并在此报告其结果。就下游性能而言,mxbai-edge-colbert-v0是一款表现尤为出色的小型模型,在常见的短文本基准测试(BEIR)上超越了ColBERTv2,并在长上下文任务中实现了效率上的重大突破,达到了前所未有的水平。
English
In this work, we introduce mxbai-edge-colbert-v0 models, at two different
parameter counts: 17M and 32M. As part of our research, we conduct numerous
experiments to improve retrieval and late-interaction models, which we intend
to distill into smaller models as proof-of-concepts. Our ultimate aim is to
support retrieval at all scales, from large-scale retrieval which lives in the
cloud to models that can run locally, on any device. mxbai-edge-colbert-v0 is a
model that we hope will serve as a solid foundation backbone for all future
experiments, representing the first version of a long series of small
proof-of-concepts. As part of the development of mxbai-edge-colbert-v0, we
conducted multiple ablation studies, of which we report the results. In terms
of downstream performance, mxbai-edge-colbert-v0 is a particularly capable
small model, outperforming ColBERTv2 on common short-text benchmarks (BEIR) and
representing a large step forward in long-context tasks, with unprecedented
efficiency.