Vary: 대규모 시각-언어 모델을 위한 시각 어휘 확장

초록

현대의 대형 시각-언어 모델(LVLMs)은 대부분의 일반적인 시각 작업을 커버할 수 있는 CLIP이라는 동일한 시각 어휘를 사용합니다. 그러나 문서 수준의 OCR이나 차트 이해와 같이 밀도가 높고 세밀한 시각 인식이 필요한 특수한 시각 작업, 특히 비영어 시나리오에서는 CLIP 스타일의 어휘가 시각 지식을 토큰화하는 데 있어 낮은 효율성을 보이거나 어휘 외 문제를 겪을 수 있습니다. 이에 따라 우리는 LVLMs의 시각 어휘를 확장하기 위한 효율적이고 효과적인 방법인 Vary를 제안합니다. Vary의 절차는 자연스럽게 새로운 시각 어휘의 생성과 통합이라는 두 가지 단계로 나뉩니다. 첫 번째 단계에서는 자동 회귀를 통해 원하는 어휘를 생성하기 위해 어휘 네트워크와 작은 디코더 전용 트랜스포머를 설계합니다. 다음 단계에서는 새로운 어휘를 원래의 어휘(CLIP)와 병합하여 LVLMs가 새로운 특징을 빠르게 습득할 수 있도록 합니다. 인기 있는 BLIP-2, MiniGPT4, LLaVA와 비교했을 때, Vary는 기존의 기능을 유지하면서도 더 우수한 세밀한 인식 및 이해 능력을 즐길 수 있습니다. 특히, Vary는 새로운 문서 파싱 기능(OCR 또는 마크다운 변환)을 수행할 수 있으며, DocVQA에서 78.2% ANLS, MMVet에서 36.2%를 달성합니다. 우리의 코드는 홈페이지에 공개될 예정입니다.

English

Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary -- CLIP, which can cover most common vision tasks. However, for some special vision task that needs dense and fine-grained vision perception, e.g., document-level OCR or chart understanding, especially in non-English scenarios, the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision knowledge and even suffer out-of-vocabulary problem. Accordingly, we propose Vary, an efficient and effective method to scale up the vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to produce the desired vocabulary via autoregression. In the next, we scale up the vanilla vision vocabulary by merging the new one with the original one (CLIP), enabling the LVLMs can quickly garner new features. Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while enjoying more excellent fine-grained perception and understanding ability. Specifically, Vary is competent in new document parsing features (OCR or markdown conversion) while achieving 78.2% ANLS in DocVQA and 36.2% in MMVet. Our code will be publicly available on the homepage.

Vary: 대규모 시각-언어 모델을 위한 시각 어휘 확장

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

초록

Support