ChatPaper.aiChatPaper

NanoVDR:将20亿参数视觉-语言检索模型蒸馏为7000万参数纯文本编码器实现视觉文档检索

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

March 13, 2026
作者: Zhuchenyang Liu, Yao Zhang, Yu Xiao
cs.AI

摘要

基于视觉语言模型(VLM)的检索器已将视觉文档检索(VDR)提升至令人瞩目的质量水平。这类方法需使用相同的数十亿参数编码器处理文档索引和查询编码,即使对纯文本查询也会产生高延迟和GPU依赖。我们发现这种设计存在不必要的对称性:文档具有视觉复杂性需强视觉理解能力,而查询仅是短文本字符串。NanoVDR通过解耦两条编码路径利用这种查询-文档非对称性:冻结的20亿参数VLM教师模型离线处理文档索引,而仅6900万参数的纯文本蒸馏学生模型负责推理时的查询编码。其核心设计在于蒸馏目标的选择。通过系统比较三种骨干网络和22个ViDoRe基准数据集上的六种目标,我们发现查询文本的点对点余弦对齐持续优于基于排序和对比学习的方法,且仅需预缓存教师模型查询嵌入,训练过程无需文档处理。此外,我们识别出跨语言迁移是主要性能瓶颈,并通过添加机器翻译查询数据低成本解决该问题。最终实现的NanoVDR-S-Multi(DistilBERT骨干,6900万参数)保留教师模型95.1%的性能,在v2和v3版本上超越20亿参数的DSE-Qwen2,参数量减少32倍,CPU查询延迟降低50倍,总训练成本不足13 GPU小时。
English
Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32times fewer parameters and 50times lower CPU query latency, at a total training cost under 13 GPU-hours.
PDF52March 30, 2026