基于语言的视觉概念学习

摘要

我们对视觉世界的理解围绕着各种概念轴展开，这些轴刻画了视觉实体的不同方面。虽然可以通过语言轻松地指定不同的概念轴，例如颜色，但沿着每个轴的确切视觉细微差别常常超出了语言表达的限制，例如特定的绘画风格。在这项工作中，我们的目标是通过简单地提炼大型预训练的视觉-语言模型，学习一种以语言为基础的视觉概念表示。具体而言，我们训练一组概念编码器来编码与一组以语言为基础的概念轴相关的信息，其目标是通过预训练的文本到图像（T2I）模型复现输入图像。为了促进不同概念编码器的更好解缠，我们将概念嵌入锚定到从预训练的视觉问答（VQA）模型中获得的一组文本嵌入。在推断时，模型从新的测试图像中提取沿着各种轴的概念嵌入，这些嵌入可以混合生成具有视觉概念新组合的图像。通过一种轻量级的测试时微调程序，它还可以推广到训练中未见过的新概念。

English

Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.

基于语言的视觉概念学习

Language-Informed Visual Concept Learning

摘要

Support