变化:为大型视觉-语言模型扩展视觉词汇量
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
December 11, 2023
作者: Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang
cs.AI
摘要
现代大型视觉-语言模型(LVLMs)享有相同的视觉词汇--CLIP,可以涵盖大多数常见的视觉任务。然而,对于一些需要密集和细粒度视觉感知的特殊视觉任务,例如文档级OCR或图表理解,特别是在非英语场景中,CLIP风格的词汇可能在标记视觉知识方面效率低下,甚至遇到词汇外问题。因此,我们提出了Vary,一种有效且高效的方法来扩展LVLMs的视觉词汇。Vary的流程自然分为两个部分:新视觉词汇的生成和整合。在第一阶段,我们设计了一个词汇网络以及一个小型仅解码器的Transformer,通过自回归生成所需的词汇。接下来,我们通过将新词汇与原始词汇(CLIP)合并来扩展基本视觉词汇,使LVLMs能够快速获取新功能。与流行的BLIP-2、MiniGPT4和LLaVA相比,Vary在保持其基本功能的同时,享有更优秀的细粒度感知和理解能力。具体而言,Vary在新文档解析功能(OCR或Markdown转换)方面表现出色,在DocVQA中实现78.2%的ANLS,在MMVet中为36.2%。我们的代码将在主页上公开提供。
English
Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary
-- CLIP, which can cover most common vision tasks. However, for some special
vision task that needs dense and fine-grained vision perception, e.g.,
document-level OCR or chart understanding, especially in non-English scenarios,
the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision
knowledge and even suffer out-of-vocabulary problem. Accordingly, we propose
Vary, an efficient and effective method to scale up the vision vocabulary of
LVLMs. The procedures of Vary are naturally divided into two folds: the
generation and integration of a new vision vocabulary. In the first phase, we
devise a vocabulary network along with a tiny decoder-only transformer to
produce the desired vocabulary via autoregression. In the next, we scale up the
vanilla vision vocabulary by merging the new one with the original one (CLIP),
enabling the LVLMs can quickly garner new features. Compared to the popular
BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while
enjoying more excellent fine-grained perception and understanding ability.
Specifically, Vary is competent in new document parsing features (OCR or
markdown conversion) while achieving 78.2% ANLS in DocVQA and 36.2% in MMVet.
Our code will be publicly available on the homepage.