視覺語言建模介紹

摘要

隨著大型語言模型（LLMs）近來的普及，已有多項嘗試將其擴展至視覺領域。從具有視覺助理的應用，可引導我們穿越陌生環境，到僅使用高層次文本描述生成圖像的生成模型，視覺語言模型（VLM）的應用將顯著影響我們與技術的關係。然而，有許多挑戰需要應對，以提高這些模型的可靠性。語言是離散的，而視覺則存在於更高維度的空間中，其中概念並非總是容易離散化。為了更好地理解將視覺映射到語言背後的機制，我們提出這份VLM簡介，希望能幫助任何有意進入該領域的人。首先，我們介紹了VLM的定義、工作原理以及訓練方法。接著，我們提出並討論評估VLM的方法。雖然這份工作主要聚焦於將圖像映射到語言，我們也討論了將VLM擴展至影片的可能性。

English

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

視覺語言建模介紹

An Introduction to Vision-Language Modeling

摘要

Support