GenRecal：从大规模到小规模视觉-语言模型的再校准后生成

摘要

近期，视觉-语言模型（VLMs）的进展通过利用大型语言模型（LLMs），实现了与GPT-4V等闭源系统相当的性能。然而，由于这些模型对计算资源的高需求，在现实场景中，尤其是在资源受限的设备上部署它们仍面临挑战。这激发了对从大型VLMs中提炼知识到更小、更高效模型中的兴趣。此处的一个关键挑战源于VLM架构的多样性，这些架构基于不同的LLMs构建，并采用各异的token类型——在词汇量、token分割及token索引顺序上存在差异。为了应对这一局限于特定VLM类型的挑战，我们提出了“重新校准后生成”（GenRecal），一个新颖的、通用的VLM蒸馏框架。GenRecal引入了一个重新校准器，用于对齐并适应异构VLMs间的特征表示，从而实现在不同类型VLMs间的有效知识迁移。通过在多个具有挑战性的基准测试上的广泛实验，我们证明了GenRecal显著提升了基线性能，最终超越了大规模的开源及闭源VLMs。

English

Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.

GenRecal：从大规模到小规模视觉-语言模型的再校准后生成

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

摘要

Support