擴展：為大型視覺語言模型擴展視覺詞彙

摘要

現代大型視覺語言模型（LVLMs）享有相同的視覺詞彙--CLIP，可以涵蓋大多數常見的視覺任務。然而，對於一些需要密集和精細視覺感知的特殊視覺任務，例如文檔級OCR或圖表理解，特別是在非英語情境下，CLIP風格的詞彙可能在對視覺知識進行標記時效率低下，甚至遇到詞彙外問題。因此，我們提出了Vary，一種有效率且有效的方法來擴展LVLMs的視覺詞彙。Vary的程序自然地分為兩個部分：生成和整合新的視覺詞彙。在第一階段，我們設計了一個詞彙網絡以及一個微小的僅解碼器變壓器，通過自回歸生成所需的詞彙。接下來，我們通過將新的詞彙與原始詞彙（CLIP）合併來擴展基本視覺詞彙，使LVLMs能夠快速獲取新功能。與流行的BLIP-2、MiniGPT4和LLaVA相比，Vary在保持其基本能力的同時，享有更出色的精細感知和理解能力。具體而言，Vary在新文檔解析功能（OCR或標記轉換）方面表現出色，在DocVQA達到78.2％的ANLS，在MMVet達到36.2％。我們的代碼將在主頁上公開提供。

English

Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary -- CLIP, which can cover most common vision tasks. However, for some special vision task that needs dense and fine-grained vision perception, e.g., document-level OCR or chart understanding, especially in non-English scenarios, the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision knowledge and even suffer out-of-vocabulary problem. Accordingly, we propose Vary, an efficient and effective method to scale up the vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to produce the desired vocabulary via autoregression. In the next, we scale up the vanilla vision vocabulary by merging the new one with the original one (CLIP), enabling the LVLMs can quickly garner new features. Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while enjoying more excellent fine-grained perception and understanding ability. Specifically, Vary is competent in new document parsing features (OCR or markdown conversion) while achieving 78.2% ANLS in DocVQA and 36.2% in MMVet. Our code will be publicly available on the homepage.

擴展：為大型視覺語言模型擴展視覺詞彙

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

摘要

Support