NanoVDR：将20亿参数视觉语言检索器蒸馏为7000万参数纯文本编码器以实现视觉文档检索

摘要

基於視覺語言模型（VLM）的檢索器已將視覺文檔檢索（VDR）的質量提升至令人矚目的水平。這類方法需使用同一個參數量達數十億級的編碼器處理文檔索引與查詢編碼，導致即使面對純文本查詢也會產生高延遲和對GPU的強依賴。我們發現這種設計存在不必要的對稱性：文檔本身具有視覺複雜性而需強視覺理解能力，而查詢僅為簡短文本字符串。NanoVDR通過解耦兩條編碼路徑來利用這種查詢-文檔非對稱性：使用凍結的20億參數VLM教師模型離線處理文檔索引，同時通過蒸餾得到僅69M參數的純文本學生模型在推理時編碼查詢。其核心設計在於蒸餾目標的選擇——通過系統性比較三種骨幹網絡在22個ViDoRe基準數據集上的六種目標函數，我們發現基於查詢文本的點對點餘弦對齊目標持續優於基於排序和對比學習的方案，且僅需預緩存的教師查詢嵌入向量，無需在訓練時處理文檔。此外，我們識別出跨語言遷移是主要性能瓶頸，並通過添加機器翻譯查詢數據的低成本方式予以解決。最終得到的NanoVDR-S-Multi（基於DistilBERT，69M參數）在保持教師模型95.1%性能的同時，以32倍更少的參數量和50倍更低的CPU查詢延遲，在v2和v3版本上超越DSE-Qwen2（2B參數），且總訓練成本低於13 GPU小時。

English

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32times fewer parameters and 50times lower CPU query latency, at a total training cost under 13 GPU-hours.

NanoVDR：将20亿参数视觉语言检索器蒸馏为7000万参数纯文本编码器以实现视觉文档检索

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

摘要

Support