大規模モデルが小規模モデルを訓練するとき：小規模視覚言語モデルを用いた効率的な視覚質問応答のためのラベル不要モデルパリティアラインメント

要旨

大規模視覚言語モデル（L-VLM）は、視覚的質問応答（VQA）を含む様々な視覚と言語タスクにおいて顕著な性能を発揮しています。しかし、その高い計算コストのため、リソースが制約された環境や推論が頻繁に行われるアプリケーションでは実用的ではありません。一方、小規模視覚言語モデル（S-VLM）は効率性を提供しますが、大規模モデルと比較して性能に大きな差があります。本研究では、ラベルなし画像とL-VLMからの効果的な知識転移を活用してS-VLMを体系的に改善するための新しいフレームワークであるModel Parity Aligner（MPA）を提案します。従来のラベル付き訓練データに依存する知識蒸留法とは異なり、MPAはS-VLMとL-VLMの間の知識の差異を正確に特定し、これらの差異のみを対象として訓練を最適化する戦略的パリティベースのアプローチを採用します。TextVQA、ST-VQA、ChartQA、OKVQAという4つの多様なVQAベンチマークにおいて広範な実験を行いました。これらのベンチマークはそれぞれ、テキスト認識、チャート解釈、常識および事実理解といった専門的な推論能力を必要とします。実験結果は、MPAがすべてのベンチマークにおいてS-VLMの性能を一貫して向上させ、計算効率を維持しながら性能差を縮小することを示しています。私たちはコードを公開しています。

English

Large Vision-Language Models (L-VLMs) have demonstrated remarkable performance in various vision and language tasks, including visual question answering (VQA). However, their high computational cost makes them impractical for resource-constrained settings and inference-heavy applications. In contrast, Small Vision-Language Models (S-VLMs) offer efficiency but suffer from a significant performance gap compared to their larger counterparts. In this work, we introduce the Model Parity Aligner (MPA), a novel framework designed to systematically improve S-VLMs by leveraging unlabeled images and effective knowledge transfer from L-VLMs. Instead of traditional knowledge distillation methods that rely on labeled training data, MPA employs a strategic parity-based approach that precisely identifies the knowledge disparities between S-VLMs and L-VLMs, and optimizes training by targeting only these disparities. We conduct extensive experiments on four diverse VQA benchmarks, namely TextVQA, ST-VQA, ChartQA, and OKVQA, each of which requires specialized reasoning capabilities such as text recognition, chart interpretation, and commonsense and factual understanding. Our results demonstrate that MPA consistently enhances the performance of S-VLMs on all benchmarks, reducing the performance gap while maintaining computational efficiency. We make our code publicly available.

大規模モデルが小規模モデルを訓練するとき：小規模視覚言語モデルを用いた効率的な視覚質問応答のためのラベル不要モデルパリティアラインメント

When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs

要旨

Support