引导图像字幕模型生成更具体的字幕
Guiding Image Captioning Models Toward More Specific Captions
July 31, 2023
作者: Simon Kornblith, Lala Li, Zirui Wang, Thao Nguyen
cs.AI
摘要
图像字幕通常被定义为生成与参考图像-字幕对的分布相匹配的图像字幕的任务。然而,在标准字幕数据集中,参考字幕较短,可能无法唯一标识描述的图像。当模型直接在从互联网收集的图像-替代文本对上进行训练时,这些问题会进一步恶化。在这项工作中,我们展示了通过对训练过程进行最小更改,可以生成更具体的字幕。我们为自回归字幕模型实现了无分类器指导,通过微调它来估计字幕的条件和无条件分布。在解码时应用的指导尺度控制了最大化 p(字幕|图像) 和 p(图像|字幕) 之间的权衡。与标准贪婪解码相比,使用指导尺度为2的解码显着改善了无参考指标,如CLIPScore(0.808 对 0.775)和在CLIP嵌入空间中的字幕到图像检索性能(recall@1 44.6% 对 26.5%),但恶化了标准基于参考的字幕指标(例如,CIDEr 78.6 对 126.1)。我们进一步探讨了使用语言模型来指导解码过程,相对于无分类器指导产生的参考-无参考字幕指标帕累托前沿,获得了小幅改进,并显着提高了仅在经过最小编辑的网络数据上训练的模型生成的字幕质量。
English
Image captioning is conventionally formulated as the task of generating
captions for images that match the distribution of reference image-caption
pairs. However, reference captions in standard captioning datasets are short
and may not uniquely identify the images they describe. These problems are
further exacerbated when models are trained directly on image-alt text pairs
collected from the internet. In this work, we show that it is possible to
generate more specific captions with minimal changes to the training process.
We implement classifier-free guidance for an autoregressive captioning model by
fine-tuning it to estimate both conditional and unconditional distributions
over captions. The guidance scale applied at decoding controls a trade-off
between maximizing p(caption|image) and
p(image|caption). Compared to standard greedy decoding,
decoding with a guidance scale of 2 substantially improves reference-free
metrics such as CLIPScore (0.808 vs. 0.775) and captiontoimage retrieval
performance in the CLIP embedding space (recall@1 44.6% vs. 26.5%), but worsens
standard reference-based captioning metrics (e.g., CIDEr 78.6 vs 126.1). We
further explore the use of language models to guide the decoding process,
obtaining small improvements over the Pareto frontier of reference-free vs.
reference-based captioning metrics that arises from classifier-free guidance,
and substantially improving the quality of captions generated from a model
trained only on minimally curated web data.