GenRecal：從大型到小型視覺語言模型的重新校準後生成

摘要

近期，視覺語言模型（VLMs）的進展已利用大型語言模型（LLMs）達到了與封閉源系統如GPT-4V相當的性能。然而，在實際應用場景中，尤其是在資源受限的設備上部署這些模型仍然具有挑戰性，這主要歸因於其龐大的計算需求。這一現狀激發了從大型VLMs中提取知識並轉移至更小、更高效模型的興趣。此處的一個關鍵挑戰來自於VLM架構的多樣性，這些架構基於不同的LLMs，並採用各異的token類型——包括詞彙量、token分割方式及token索引順序的差異。為應對這一特定VLM類型限制的挑戰，我們提出了「重新校準後生成」（GenRecal），這是一種新穎的、通用於VLMs的知識蒸餾框架。GenRecal引入了一個重新校準器（Recalibrator），用於對齊並適應異構VLMs之間的特徵表示，從而實現跨不同類型VLMs的有效知識轉移。通過在多個具有挑戰性的基準上進行廣泛實驗，我們證明了GenRecal顯著提升了基線性能，最終超越了大型開源及封閉源VLMs。

English

Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.

GenRecal：從大型到小型視覺語言模型的重新校準後生成

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

摘要

Support