Veagle：マルチモーダル表現学習の進展

要旨

最近、人工知能の研究者たちは、言語と視覚がどのように結びつくかに強い関心を寄せており、テキストと視覚情報をシームレスに統合することを目指すマルチモーダルモデルの開発が進んでいます。大規模言語モデル（LLMs）を拡張したマルチモーダルモデルは、画像キャプショニングや視覚的質問応答（VQA）、視覚的グラウンディングなど、多様なタスクにおいて顕著な能力を示しています。これらのモデルは大きな進歩を見せていますが、現実世界のシナリオでよく見られるように、画像を正確に解釈し質問に答えることには依然として課題が残っています。本論文では、既存モデルのマルチモーダル能力を強化するための新しいアプローチを紹介します。現在の視覚言語モデル（VLMs）やマルチモーダル大規模言語モデル（MLLMs）で観察された限界に対応するため、我々が提案するモデル「Veagle」は、過去の研究の成功と洞察に基づいた独自のメカニズムを組み込んでいます。Veagleは、エンコードされた視覚情報を直接言語モデルに投影する動的メカニズムを活用します。この動的アプローチにより、視覚的文脈に存在する複雑な詳細をよりニュアンス豊かに理解することが可能になります。Veagleの有効性を検証するため、ベンチマークデータセットを用いて視覚的質問応答や画像理解などのタスクに重点を置いた包括的な実験を行いました。その結果、Veagleは既存のモデルを大きく上回り、性能が5-6％向上することが示されました。この結果は、モデルの汎用性と従来のベンチマークを超えた適用可能性を強調しています。

English

Lately, researchers in artificial intelligence have been really interested in how language and vision come together, giving rise to the development of multimodal models that aim to seamlessly integrate textual and visual information. Multimodal models, an extension of Large Language Models (LLMs), have exhibited remarkable capabilities in addressing a diverse array of tasks, ranging from image captioning and visual question answering (VQA) to visual grounding. While these models have showcased significant advancements, challenges persist in accurately interpreting images and answering the question, a common occurrence in real-world scenarios. This paper introduces a novel approach to enhance the multimodal capabilities of existing models. In response to the limitations observed in current Vision Language Models (VLMs) and Multimodal Large Language Models (MLLMs), our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works. Veagle leverages a dynamic mechanism to project encoded visual information directly into the language model. This dynamic approach allows for a more nuanced understanding of intricate details present in visual contexts. To validate the effectiveness of Veagle, we conduct comprehensive experiments on benchmark datasets, emphasizing tasks such as visual question answering and image understanding. Our results indicate a improvement of 5-6 \% in performance, with Veagle outperforming existing models by a notable margin. The outcomes underscore the model's versatility and applicability beyond traditional benchmarks.

Veagle：マルチモーダル表現学習の進展

Veagle: Advancements in Multimodal Representation Learning

要旨

Support