当大模型训练小模型时：利用小型视觉语言模型实现高效视觉问答的无标签模型性能对齐

摘要

大型视觉语言模型（L-VLMs）在多种视觉与语言任务中，包括视觉问答（VQA），展现了卓越的性能。然而，其高昂的计算成本使其在资源受限的环境和推理密集型应用中显得不切实际。相比之下，小型视觉语言模型（S-VLMs）虽具效率，但与大型模型相比存在显著的性能差距。本研究提出了一种新颖的框架——模型性能对齐器（MPA），旨在通过利用未标注图像和从L-VLMs中有效转移知识，系统性地提升S-VLMs。不同于依赖标注训练数据的传统知识蒸馏方法，MPA采用了一种基于性能差异的策略性方法，精确识别S-VLMs与L-VLMs之间的知识差距，并针对这些差距优化训练。我们在四个多样化的VQA基准测试上进行了广泛实验，包括TextVQA、ST-VQA、ChartQA和OKVQA，每个测试均需特定的推理能力，如文本识别、图表解读及常识与事实理解。实验结果表明，MPA在所有基准测试上均能持续提升S-VLMs的性能，缩小性能差距的同时保持计算效率。我们已将代码公开。

English

Large Vision-Language Models (L-VLMs) have demonstrated remarkable performance in various vision and language tasks, including visual question answering (VQA). However, their high computational cost makes them impractical for resource-constrained settings and inference-heavy applications. In contrast, Small Vision-Language Models (S-VLMs) offer efficiency but suffer from a significant performance gap compared to their larger counterparts. In this work, we introduce the Model Parity Aligner (MPA), a novel framework designed to systematically improve S-VLMs by leveraging unlabeled images and effective knowledge transfer from L-VLMs. Instead of traditional knowledge distillation methods that rely on labeled training data, MPA employs a strategic parity-based approach that precisely identifies the knowledge disparities between S-VLMs and L-VLMs, and optimizes training by targeting only these disparities. We conduct extensive experiments on four diverse VQA benchmarks, namely TextVQA, ST-VQA, ChartQA, and OKVQA, each of which requires specialized reasoning capabilities such as text recognition, chart interpretation, and commonsense and factual understanding. Our results demonstrate that MPA consistently enhances the performance of S-VLMs on all benchmarks, reducing the performance gap while maintaining computational efficiency. We make our code publicly available.

当大模型训练小模型时：利用小型视觉语言模型实现高效视觉问答的无标签模型性能对齐

When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs

摘要

Support