GenRecal: 대규모에서 소규모로 재조정 후 생성 시각-언어 모델

초록

최근 비전-언어 모델(VLMs)의 발전은 대규모 언어 모델(LLMs)을 활용하여 GPT-4V와 같은 폐쇄형 시스템에 필적하는 성능을 달성했습니다. 그러나 이러한 모델을 실제 시나리오, 특히 자원이 제한된 장치에 배포하는 것은 상당한 계산 요구로 인해 여전히 어려운 과제로 남아 있습니다. 이는 대규모 VLMs의 지식을 더 작고 효율적인 모델로 증류하려는 관심을 불러일으켰습니다. 여기서 주요 도전 과제는 다양한 VLM 아키텍처에서 비롯됩니다. 이들은 서로 다른 LLMs를 기반으로 구축되며, 어휘 크기, 토큰 분할 방식, 토큰 인덱스 순서 등에서 차이가 나는 다양한 토큰 유형을 사용합니다. 특정 VLM 유형에 한정된 이러한 문제를 해결하기 위해, 우리는 VLMs를 위한 범용 증류 프레임워크인 Generation after Recalibration (GenRecal)을 제안합니다. GenRecal은 이종 VLMs 간의 특징 표현을 정렬하고 조정하는 Recalibrator를 포함하여, 서로 다른 유형의 VLMs 간에 효과적인 지식 전달을 가능하게 합니다. 여러 도전적인 벤치마크에서의 광범위한 실험을 통해, GenRecal이 베이스라인 성능을 크게 개선하고, 결국 대규모 오픈소스 및 폐쇄형 VLMs를 능가함을 입증했습니다.

English

Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.

GenRecal: 대규모에서 소규모로 재조정 후 생성 시각-언어 모델

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

초록

Support