當大型模型訓練小型模型:利用小型視覺語言模型實現高效視覺問答的無標籤模型對齊
When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs
September 20, 2025
作者: Abhirama Subramanyam Penamakuri, Navlika Singh, Piyush Arora, Anand Mishra
cs.AI
摘要
大型視覺語言模型(L-VLMs)在多種視覺與語言任務中展現了卓越的性能,包括視覺問答(VQA)。然而,其高昂的計算成本使得它們在資源受限的環境和推理密集型應用中顯得不太實用。相比之下,小型視覺語言模型(S-VLMs)雖具效率,但與大型模型相比存在顯著的性能差距。在本研究中,我們引入了模型對齊器(MPA),這是一個新穎的框架,旨在通過利用未標記圖像和從L-VLMs進行有效的知識轉移,系統性地提升S-VLMs的性能。MPA不同於依賴標記訓練數據的傳統知識蒸餾方法,而是採用了一種基於對齊的策略,精確識別S-VLMs與L-VLMs之間的知識差異,並僅針對這些差異進行訓練優化。我們在四個多樣化的VQA基準測試上進行了廣泛的實驗,包括TextVQA、ST-VQA、ChartQA和OKVQA,每個測試都需要特定的推理能力,如文本識別、圖表解釋以及常識與事實理解。我們的結果表明,MPA在所有基準測試上均能持續提升S-VLMs的性能,在保持計算效率的同時縮小了性能差距。我們已將代碼公開。
English
Large Vision-Language Models (L-VLMs) have demonstrated remarkable
performance in various vision and language tasks, including visual question
answering (VQA). However, their high computational cost makes them impractical
for resource-constrained settings and inference-heavy applications. In
contrast, Small Vision-Language Models (S-VLMs) offer efficiency but suffer
from a significant performance gap compared to their larger counterparts. In
this work, we introduce the Model Parity Aligner (MPA), a novel framework
designed to systematically improve S-VLMs by leveraging unlabeled images and
effective knowledge transfer from L-VLMs. Instead of traditional knowledge
distillation methods that rely on labeled training data, MPA employs a
strategic parity-based approach that precisely identifies the knowledge
disparities between S-VLMs and L-VLMs, and optimizes training by targeting only
these disparities. We conduct extensive experiments on four diverse VQA
benchmarks, namely TextVQA, ST-VQA, ChartQA, and OKVQA, each of which requires
specialized reasoning capabilities such as text recognition, chart
interpretation, and commonsense and factual understanding. Our results
demonstrate that MPA consistently enhances the performance of S-VLMs on all
benchmarks, reducing the performance gap while maintaining computational
efficiency. We make our code publicly available.