언어 기반 시각 개념 학습

초록

우리의 시각 세계에 대한 이해는 다양한 개념 축을 중심으로 이루어지며, 이는 시각적 개체의 다양한 측면을 특징짓습니다. 서로 다른 개념 축은 언어를 통해 쉽게 명시될 수 있지만(예: 색상), 각 축을 따라 존재하는 정확한 시각적 뉘앙스는 종종 언어적 표현의 한계를 초과합니다(예: 특정한 그림 스타일). 본 연구에서는 대규모로 사전 학습된 시각-언어 모델을 단순히 증류함으로써, 언어 정보를 반영한 시각적 개념 표현을 학습하는 것을 목표로 합니다. 구체적으로, 우리는 사전 학습된 텍스트-이미지(T2I) 모델을 통해 입력 이미지를 재구성하는 목표로, 언어 정보를 반영한 개념 축 집합과 관련된 정보를 인코딩하기 위해 일련의 개념 인코더를 학습합니다. 서로 다른 개념 인코더 간의 더 나은 분리를 촉진하기 위해, 우리는 사전 학습된 시각 질의 응답(VQA) 모델에서 얻은 텍스트 임베딩 집합에 개념 임베딩을 고정합니다. 추론 시, 모델은 새로운 테스트 이미지로부터 다양한 축을 따라 개념 임베딩을 추출하며, 이를 재조합하여 새로운 시각적 개념 조합을 가진 이미지를 생성할 수 있습니다. 경량화된 테스트 시점 미세 조정 절차를 통해, 학습 시 보지 못한 새로운 개념으로도 일반화할 수 있습니다.

English

Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.

언어 기반 시각 개념 학습

Language-Informed Visual Concept Learning

초록

Support