言語情報を活用した視覚的概念学習

要旨

私たちの視覚世界の理解は、視覚的実体の異なる側面を特徴づける様々な概念軸を中心に展開しています。異なる概念軸は、例えば色のように言語によって容易に指定できますが、各軸に沿った正確な視覚的ニュアンスは、例えば特定の絵画スタイルのように、言語的表現の限界をしばしば超えます。本研究では、大規模な事前学習済み視覚言語モデルを蒸留するだけで、言語情報を取り入れた視覚的概念表現を学習することを目指します。具体的には、事前学習済みのテキストから画像生成（T2I）モデルを通じて入力画像を再現することを目的として、言語情報を取り入れた概念軸のセットに関連する情報をエンコードするための一連の概念エンコーダを訓練します。異なる概念エンコーダのより良い分離を促進するために、事前学習済みの視覚質問応答（VQA）モデルから得られたテキスト埋め込みのセットに概念埋め込みを固定します。推論時には、モデルは新しいテスト画像から様々な軸に沿った概念埋め込みを抽出し、それらを再混合して視覚的概念の新しい組み合わせを持つ画像を生成できます。軽量なテスト時微調整手順を用いることで、訓練時には見られなかった新しい概念にも一般化することが可能です。

English

Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.

言語情報を活用した視覚的概念学習

Language-Informed Visual Concept Learning

要旨

Support