ChatPaper.aiChatPaper

VCR:视觉字幕恢复

VCR: Visual Caption Restoration

June 10, 2024
作者: Tianyu Zhang, Suyuchen Wang, Lu Li, Ge Zhang, Perouz Taslakian, Sai Rajeswar, Jie Fu, Bang Liu, Yoshua Bengio
cs.AI

摘要

我们介绍了一项名为视觉字幕修复(VCR)的新型视觉-语言任务,该任务挑战模型使用图像内的像素级提示准确修复部分遮挡的文本。这一任务源于这样一个观察结果:嵌入在图像中的文本与常见的视觉元素和自然语言有本质区别,因为需要对齐视觉、文本和嵌入在图像中的文本的模态。虽然许多研究已经将嵌入在图像中的文本整合到视觉问答任务中,但是对这些任务的方法通常依赖于光学字符识别或遮罩语言建模,从而将任务主要转化为基于文本的处理。然而,在VCR中,基于文本的处理变得无效,因为准确的文本恢复取决于提供的图像、上下文以及遮挡文本的微小暴露区域的微妙线索的综合信息。我们开发了一个流程来为VCR任务生成合成图像,使用图像-字幕对,可调节字幕的可见性以控制任务难度。通过这一流程,我们构建了一个名为VCR-Wiki的VCR数据集,使用来自维基百科的图像及字幕,包括211万个英文实体和34.6万个中文实体,分为简单和困难两个变体。我们的结果显示,当前的视觉语言模型在VCR任务中明显落后于人类表现,仅仅在我们的数据集上微调模型并不能带来显著改进。我们发布了VCR-Wiki和数据构建代码,以促进未来的研究。
English
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images. While numerous works have integrated text embedded in images into visual question-answering tasks, approaches to these tasks generally rely on optical character recognition or masked language modeling, thus reducing the task to mainly text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct a dataset for VCR called VCR-Wiki using images with captions from Wikipedia, comprising 2.11M English and 346K Chinese entities in both easy and hard split variants. Our results reveal that current vision language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-Wiki and the data construction code to facilitate future research.

Summary

AI-Generated Summary

PDF131December 8, 2024