小型語言模型遇見了強化視覺詞彙

摘要

在2023年，AI社群中流行著玩大型視覺語言模型（LVLMs）。然而，熱門LVLMs相對較大的參數數量（超過7B）使得在消費者GPU上訓練和部署變得困難，這阻礙了許多資源有限的研究人員。想像一下，在一張老舊的GTX1080ti（我們唯一的遊戲卡）上體驗當前LVLMs的所有功能會有多酷。因此，我們在本報告中提出了Vary-toy，一個小型Vary，以Qwen-1.8B作為基礎的“大”語言模型。在Vary-toy中，我們引入了一個改進的視覺詞彙，使模型不僅具備Vary的所有功能，還具有更多的泛化性。具體來說，在生成視覺詞彙的過程中，我們用物體檢測驅動的正樣本數據取代自然圖像的負樣本，更充分地利用詞彙網絡的容量，使其能夠有效編碼對應於自然物體的視覺信息。在實驗中，Vary-toy在DocVQA上可達到65.6%的ANLS，ChartQA上的準確率為59.1%，RefCOCO上的準確率為88.1%，MMVet上為29%。代碼將在主頁上公開提供。

English

Playing Large Vision Language Models (LVLMs) in 2023 is trendy among the AI community. However, the relatively large number of parameters (more than 7B) of popular LVLMs makes it difficult to train and deploy on consumer GPUs, discouraging many researchers with limited resources. Imagine how cool it would be to experience all the features of current LVLMs on an old GTX1080ti (our only game card). Accordingly, we present Vary-toy in this report, a small-size Vary along with Qwen-1.8B as the base ``large'' language model. In Vary-toy, we introduce an improved vision vocabulary, allowing the model to not only possess all features of Vary but also gather more generality. Specifically, we replace negative samples of natural images with positive sample data driven by object detection in the procedure of generating vision vocabulary, more sufficiently utilizing the capacity of the vocabulary network and enabling it to efficiently encode visual information corresponding to natural objects. For experiments, Vary-toy can achieve 65.6% ANLS on DocVQA, 59.1% accuracy on ChartQA, 88.1% accuracy on RefCOCO, and 29% on MMVet. The code will be publicly available on the homepage.

小型語言模型遇見了強化視覺詞彙

Small Language Model Meets with Reinforced Vision Vocabulary

摘要

Support