Veagle：多模態表示學習的進展

摘要

最近，人工智慧領域的研究人員對語言和視覺如何結合產生了濃厚興趣，這促使了多模型的發展，旨在無縫整合文本和視覺信息。多模型是大型語言模型（LLMs）的延伸，展現出在處理各種任務上的卓越能力，從圖像標註和視覺問答（VQA）到視覺定位。儘管這些模型展示了顯著的進展，但在準確解釋圖像並回答問題方面仍存在挑戰，這在現實場景中很常見。本文介紹了一種增強現有模型多模能力的新方法。為應對當前視覺語言模型（VLMs）和多模大型語言模型（MLLMs）中觀察到的限制，我們提出的Veagle模型融合了一個獨特的機制，靈感來自先前作品的成功和見解。Veagle利用一個動態機制，將編碼的視覺信息直接投影到語言模型中。這種動態方法允許對視覺情境中的微妙細節有更細緻的理解。為驗證Veagle的有效性，我們在基準數據集上進行了全面實驗，重點放在視覺問答和圖像理解等任務上。我們的結果表明，在性能方面提高了5-6％，Veagle在性能上明顯優於現有模型。這些結果突顯了該模型在傳統基準之外的多樣性和應用性。

English

Lately, researchers in artificial intelligence have been really interested in how language and vision come together, giving rise to the development of multimodal models that aim to seamlessly integrate textual and visual information. Multimodal models, an extension of Large Language Models (LLMs), have exhibited remarkable capabilities in addressing a diverse array of tasks, ranging from image captioning and visual question answering (VQA) to visual grounding. While these models have showcased significant advancements, challenges persist in accurately interpreting images and answering the question, a common occurrence in real-world scenarios. This paper introduces a novel approach to enhance the multimodal capabilities of existing models. In response to the limitations observed in current Vision Language Models (VLMs) and Multimodal Large Language Models (MLLMs), our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works. Veagle leverages a dynamic mechanism to project encoded visual information directly into the language model. This dynamic approach allows for a more nuanced understanding of intricate details present in visual contexts. To validate the effectiveness of Veagle, we conduct comprehensive experiments on benchmark datasets, emphasizing tasks such as visual question answering and image understanding. Our results indicate a improvement of 5-6 \% in performance, with Veagle outperforming existing models by a notable margin. The outcomes underscore the model's versatility and applicability beyond traditional benchmarks.

Veagle：多模態表示學習的進展

Veagle: Advancements in Multimodal Representation Learning

摘要

Support