視覺字幕修復
VCR: Visual Caption Restoration
June 10, 2024
作者: Tianyu Zhang, Suyuchen Wang, Lu Li, Ge Zhang, Perouz Taslakian, Sai Rajeswar, Jie Fu, Bang Liu, Yoshua Bengio
cs.AI
摘要
我們介紹了視覺標題修復(VCR),這是一個新穎的視覺-語言任務,挑戰模型使用圖像內的像素級提示來準確修復部分遮蔽的文本。這個任務源於一個觀察,即嵌入圖像中的文本與常見的視覺元素和自然語言 intrinsically 不同,因為需要對齊視覺、文本和嵌入圖像中的文本的模態。儘管許多作品將嵌入圖像中的文本整合到視覺問答任務中,但這些任務的方法通常依賴於光學字符識別或遮罩語言建模,因此將任務主要簡化為基於文本的處理。然而,在 VCR 中,基於文本的處理變得無效,因為準確的文本修復取決於從提供的圖像、上下文和遮蔽文本的微小暴露區域的微妙提示中獲得的結合信息。我們開發了一個流程,使用圖像標題對來生成 VCR 任務的合成圖像,並通過調整標題的可見性來控制任務的難度。通過這個流程,我們構建了一個名為 VCR-Wiki 的 VCR 數據集,其中包含來自維基百科的帶有圖像標題的圖像,包括 2.11M 英文和 346K 中文實體,分別有簡單和困難的變體。我們的結果顯示,當前的視覺語言模型在 VCR 任務中明顯落後於人類表現,僅對我們的數據集進行微調並不會帶來顯著的改善。我們釋出了 VCR-Wiki 和數據構建代碼,以促進未來的研究。
English
We introduce Visual Caption Restoration (VCR), a novel vision-language task
that challenges models to accurately restore partially obscured texts using
pixel-level hints within images. This task stems from the observation that text
embedded in images is intrinsically different from common visual elements and
natural language due to the need to align the modalities of vision, text, and
text embedded in images. While numerous works have integrated text embedded in
images into visual question-answering tasks, approaches to these tasks
generally rely on optical character recognition or masked language modeling,
thus reducing the task to mainly text-based processing. However, text-based
processing becomes ineffective in VCR as accurate text restoration depends on
the combined information from provided images, context, and subtle cues from
the tiny exposed areas of masked texts. We develop a pipeline to generate
synthetic images for the VCR task using image-caption pairs, with adjustable
caption visibility to control the task difficulty. With this pipeline, we
construct a dataset for VCR called VCR-Wiki using images with captions from
Wikipedia, comprising 2.11M English and 346K Chinese entities in both easy and
hard split variants. Our results reveal that current vision language models
significantly lag behind human performance in the VCR task, and merely
fine-tuning the models on our dataset does not lead to notable improvements. We
release VCR-Wiki and the data construction code to facilitate future research.