NanoVDR: 2Bパラメータの視覚言語検索モデルを70Mパラメータのテキスト専用エンコーダへ蒸留し、文書画像検索を実現

要旨

Vision-Language Model (VLM) に基づく検索モデルは、視覚的文書検索 (VDR) の品質を驚異的な水準にまで高めてきた。しかし、これらのモデルは、文書のインデキシングとクエリのエンコーディングの両方に同一の数十億パラメータ規模のエンコーダを必要とするため、高いレイテンシとGPUへの依存が生じ、たとえテキストのみのクエリであっても同様である。我々は、この設計が不必要に対称的であると考える。すなわち、文書は視覚的に複雑であり強力な視覚的理解を要求する一方で、クエリは単なる短いテキスト文字列に過ぎない。NanoVDR は、このクエリと文書の非対称性を利用し、二つのエンコーディング経路を分離する。具体的には、凍結された20億パラメータのVLM教師モデルが文書をオフラインでインデキシングし、6900万パラメータという小さな蒸留されたテキスト専用の学生モデルが推論時にクエリをエンコードする。鍵となる設計上の選択は蒸留の目的関数である。3つのバックボーンと22のViDoReベンチマークデータセットを用いた6つの目的関数の体系的な比較を通じて、クエリテキストに対するポイントワイズ・コサインアライメントが、ランキングベースやコントラスティブな手法を一貫して上回り、かつ学習時には事前キャッシュされた教師モデルのクエリ埋め込みのみを必要とし、文書処理を全く行わないことを明らかにした。さらに、性能の主要なボトルネックが言語間転移であることを特定し、機械翻訳されたクエリで学習データを拡張するという低コストな手法でこれを解決した。その結果得られた NanoVDR-S-Multi (DistilBERT, 69M) は、教師モデルの品質を95.1%維持し、v2およびv3において DSE-Qwen2 (2B) を性能で上回りながら、パラメータ数は32分の1、CPUクエリレイテンシは50分の1を実現し、総学習コストは13 GPU時間未満である。

English

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32times fewer parameters and 50times lower CPU query latency, at a total training cost under 13 GPU-hours.

NanoVDR: 2Bパラメータの視覚言語検索モデルを70Mパラメータのテキスト専用エンコーダへ蒸留し、文書画像検索を実現

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

要旨

Support