基於語言資訊的視覺概念學習

摘要

我們對視覺世界的理解圍繞著各種概念軸展開，這些軸表徵著視覺實體的不同方面。儘管不同的概念軸可以通過語言輕鬆指定，例如顏色，但沿著每個軸的視覺細微差異常常超出語言表達的限制，例如特定的繪畫風格。在這項工作中，我們的目標是通過簡單地提煉大型預訓練的視覺語言模型，學習一種以語言為基礎的視覺概念表示。具體來說，我們訓練一組概念編碼器來編碼與一組以語言為基礎的概念軸相關的信息，其目標是通過預先訓練的文本到圖像（T2I）模型重現輸入圖像。為了促進不同概念編碼器的更好分離，我們將概念嵌入錨定到從預先訓練的視覺問答（VQA）模型獲得的一組文本嵌入。在推論時，模型從新的測試圖像中提取沿著各種軸的概念嵌入，這些嵌入可以混合以生成具有視覺概念新組合的圖像。通過一個輕量級的測試時微調程序，它還可以泛化到訓練時未見過的新概念。

English

Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.

基於語言資訊的視覺概念學習

Language-Informed Visual Concept Learning

摘要

Support