GenRecal: 大規模から小規模への再較正後の生成視覚-言語モデル

要旨

近年の視覚言語モデル（VLM）の進展により、大規模言語モデル（LLM）を活用することで、GPT-4Vのようなクローズドソースシステムと同等の性能を達成することが可能となった。しかし、これらのモデルを実際のシナリオ、特にリソースが制約されたデバイス上で展開することは、その膨大な計算需要のために依然として困難である。これにより、大規模なVLMから知識を抽出し、より小型で効率的なモデルに蒸留することへの関心が高まっている。ここで重要な課題となるのは、VLMアーキテクチャの多様性である。これらのアーキテクチャは異なるLLMを基盤としており、語彙サイズ、トークン分割、トークンインデックスの順序などが異なる様々なトークンタイプを採用している。特定のVLMタイプに限定されるという課題に対処するため、我々はGeneration after Recalibration（GenRecal）という、VLMのための汎用的な蒸留フレームワークを提案する。GenRecalは、異種のVLM間で特徴表現を整列・適応させるRecalibratorを組み込んでおり、異なるタイプのVLM間での効果的な知識転移を可能にする。複数の挑戦的なベンチマークでの広範な実験を通じて、GenRecalがベースライン性能を大幅に向上させ、最終的には大規模なオープンソースおよびクローズドソースのVLMを凌駕することを実証した。

English

Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.

GenRecal: 大規模から小規模への再較正後の生成視覚-言語モデル

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

要旨

Support