ビジョン・ランゲージモデリング入門

要旨

大規模言語モデル（LLMs）の最近の人気に続き、視覚領域への拡張がいくつか試みられています。不慣れな環境を案内する視覚アシスタントから、高レベルのテキスト記述のみを使用して画像を生成する生成モデルまで、視覚言語モデル（VLM）の応用は、私たちとテクノロジーの関係に大きな影響を与えるでしょう。しかし、これらのモデルの信頼性を向上させるためには、多くの課題に対処する必要があります。言語は離散的であるのに対し、視覚ははるかに高次元の空間で進化し、概念を常に簡単に離散化できるとは限りません。視覚と言語のマッピングの背後にあるメカニズムをよりよく理解するために、このVLMの入門を紹介します。これは、この分野に参入したいと考えている人々の助けになることを願っています。まず、VLMとは何か、どのように機能するか、そしてどのように訓練するかを紹介します。次に、VLMを評価するためのアプローチを提示し、議論します。この研究は主に画像と言語のマッピングに焦点を当てていますが、VLMをビデオに拡張することについても議論します。

English

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

ビジョン・ランゲージモデリング入門

An Introduction to Vision-Language Modeling

要旨

Summary

Support

Support