Vary: 大規模視覚言語モデルのための視覚語彙のスケールアップ

要旨

現代の大規模視覚言語モデル（LVLM）は、ほとんどの一般的な視覚タスクをカバーできるCLIPという共通の視覚語彙を享受しています。しかし、文書レベルのOCRやチャート理解など、密で細かい視覚知覚を必要とする特殊な視覚タスク、特に非英語のシナリオでは、CLIPスタイルの語彙は視覚知識のトークン化において効率が低く、語彙外の問題に直面する可能性があります。これに対応して、我々はVaryを提案します。Varyは、LVLMの視覚語彙を拡張するための効率的かつ効果的な方法です。Varyの手順は自然に2つの段階に分かれます：新しい視覚語彙の生成と統合です。最初の段階では、語彙ネットワークと小さなデコーダのみのトランスフォーマーを設計し、自己回帰を通じて目的の語彙を生成します。次に、新しい語彙を元の語彙（CLIP）と統合することで、バニラ視覚語彙を拡張し、LVLMが迅速に新しい特徴を獲得できるようにします。人気のあるBLIP-2、MiniGPT4、LLaVAと比較して、Varyはそのバニラ能力を維持しながら、より優れた細かい知覚と理解能力を享受できます。具体的には、Varyは新しい文書解析機能（OCRやマークダウン変換）に適しており、DocVQAでは78.2%のANLS、MMVetでは36.2%を達成します。我々のコードはホームページで公開されます。

English

Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary -- CLIP, which can cover most common vision tasks. However, for some special vision task that needs dense and fine-grained vision perception, e.g., document-level OCR or chart understanding, especially in non-English scenarios, the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision knowledge and even suffer out-of-vocabulary problem. Accordingly, we propose Vary, an efficient and effective method to scale up the vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to produce the desired vocabulary via autoregression. In the next, we scale up the vanilla vision vocabulary by merging the new one with the original one (CLIP), enabling the LVLMs can quickly garner new features. Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while enjoying more excellent fine-grained perception and understanding ability. Specifically, Vary is competent in new document parsing features (OCR or markdown conversion) while achieving 78.2% ANLS in DocVQA and 36.2% in MMVet. Our code will be publicly available on the homepage.

Vary: 大規模視覚言語モデルのための視覚語彙のスケールアップ

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

要旨

Support