VCR: 시각적 캡션 복원

초록

우리는 이미지 내 픽셀 수준의 힌트를 활용하여 부분적으로 가려진 텍스트를 정확하게 복원하도록 모델에 도전하는 새로운 비전-언어 과제인 Visual Caption Restoration(VCR)을 소개한다. 이 과제는 이미지에 내장된 텍스트가 시각, 텍스트, 그리고 이미지에 내장된 텍스트의 양상을 정렬해야 한다는 점에서 일반적인 시각 요소 및 자연어와 본질적으로 다르다는 관찰에서 비롯되었다. 이미지에 내장된 텍스트를 시각적 질의응답 과제에 통합한 많은 연구들이 있지만, 이러한 과제에 대한 접근 방식은 일반적으로 광학 문자 인식(OCR) 또는 마스크된 언어 모델링에 의존하여 주로 텍스트 기반 처리로 과제를 축소한다. 그러나 VCR에서는 정확한 텍스트 복원이 제공된 이미지, 컨텍스트, 그리고 마스크된 텍스트의 미세하게 노출된 영역에서의 미묘한 단서로부터의 결합된 정보에 의존하기 때문에 텍스트 기반 처리는 비효율적이 된다. 우리는 이미지-캡션 쌍을 사용하여 VCR 과제를 위한 합성 이미지를 생성하는 파이프라인을 개발했으며, 캡션 가시성을 조절하여 과제 난이도를 제어할 수 있다. 이 파이프라인을 통해 위키백과의 캡션이 포함된 이미지를 사용하여 VCR-Wiki라는 데이터셋을 구축했으며, 이 데이터셋은 쉬운 버전과 어려운 버전으로 나뉜 211만 개의 영어 및 34만 6천 개의 중국어 엔티티로 구성된다. 우리의 결과는 현재의 비전-언어 모델들이 VCR 과제에서 인간의 성능에 크게 뒤처지며, 우리의 데이터셋에 대해 단순히 미세 조정하는 것이 뚜렷한 개선으로 이어지지 않음을 보여준다. 우리는 VCR-Wiki와 데이터 구축 코드를 공개하여 향후 연구를 촉진하고자 한다.

English

We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images. While numerous works have integrated text embedded in images into visual question-answering tasks, approaches to these tasks generally rely on optical character recognition or masked language modeling, thus reducing the task to mainly text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct a dataset for VCR called VCR-Wiki using images with captions from Wikipedia, comprising 2.11M English and 346K Chinese entities in both easy and hard split variants. Our results reveal that current vision language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-Wiki and the data construction code to facilitate future research.

VCR: 시각적 캡션 복원

VCR: Visual Caption Restoration

초록

Support