대규모 시각-언어 모델을 활용한 텍스트의 시각적 특성 학습

초록

시각적 텍스트는 사람의 마음속에 이미지를 불러일으키는 반면, 비시각적 텍스트는 그렇지 못합니다. 텍스트의 시각성을 자동으로 감지하는 방법은 관련 이미지로 텍스트를 보강할 수 있는 능력을 제공할 것입니다. 신경망 기반 텍스트-이미지 생성 및 검색 모델은 입력 텍스트가 본질적으로 시각적이라는 암묵적인 가정 하에 작동하기 때문입니다. 우리는 3,620개의 영어 문장과 다수의 인간 주석자가 제공한 시각성 점수로 구성된 데이터셋을 구축했습니다. 또한, 텍스트와 시각적 자산을 포함한 문서를 사용하여 문서 텍스트와 관련 이미지로 구성된 원격 감독 코퍼스를 생성했습니다. 또한, 텍스트와 이미지 간의 일대일 대응을 가정하는 CLIP과 같은 대규모 시각-언어 모델을 텍스트 입력만으로 시각성을 점수화하는 작업에 적응시키는 미세 조정 전략을 제안합니다. 우리의 전략은 모델의 대조 학습 목표를 수정하여 비시각적 텍스트를 공통의 NULL 이미지에 매핑하고, 시각적 텍스트를 문서 내 해당 이미지와 매칭시키는 것을 포함합니다. 우리는 제안된 접근법이 (i) 시각적 및 비시각적 텍스트를 정확하게 분류하는 능력과 (ii) 심리언어학 연구에서 시각적이라고 식별된 단어에 주의를 기울이는 능력을 평가합니다. 실험적 평가는 우리의 접근법이 제안된 작업에 대해 여러 휴리스틱 및 베이스라인 모델보다 더 나은 성능을 보인다는 것을 나타냅니다. 또한, 텍스트의 시각성을 모델링하는 것의 중요성을 강조하기 위해 DALL-E와 같은 텍스트-이미지 생성 시스템에 대한 질적 분석을 수행합니다.

English

Visual text evokes an image in a person's mind, while non-visual text fails to do so. A method to automatically detect visualness in text will unlock the ability to augment text with relevant images, as neural text-to-image generation and retrieval models operate on the implicit assumption that the input text is visual in nature. We curate a dataset of 3,620 English sentences and their visualness scores provided by multiple human annotators. Additionally, we use documents that contain text and visual assets to create a distantly supervised corpus of document text and associated images. We also propose a fine-tuning strategy that adapts large vision-language models like CLIP that assume a one-to-one correspondence between text and image to the task of scoring text visualness from text input alone. Our strategy involves modifying the model's contrastive learning objective to map text identified as non-visual to a common NULL image while matching visual text to their corresponding images in the document. We evaluate the proposed approach on its ability to (i) classify visual and non-visual text accurately, and (ii) attend over words that are identified as visual in psycholinguistic studies. Empirical evaluation indicates that our approach performs better than several heuristics and baseline models for the proposed task. Furthermore, to highlight the importance of modeling the visualness of text, we conduct qualitative analyses of text-to-image generation systems like DALL-E.

대규모 시각-언어 모델을 활용한 텍스트의 시각적 특성 학습

Learning the Visualness of Text Using Large Vision-Language Models

초록

Support