引導圖像標題生成模型朝向更具體的標題
Guiding Image Captioning Models Toward More Specific Captions
July 31, 2023
作者: Simon Kornblith, Lala Li, Zirui Wang, Thao Nguyen
cs.AI
摘要
圖像標題生成通常被定義為為圖像生成與參考圖像標題配對分佈相符的標題的任務。然而,在標準標題數據集中的參考標題通常較短,可能無法唯一識別描述的圖像。當模型直接在從互聯網收集的圖像-替代文本對上進行訓練時,這些問題進一步惡化。在這項工作中,我們展示了可以在訓練過程中進行最小更改來生成更具體標題的可能性。我們通過對自回歸標題生成模型進行微調,實現了無需分類器的引導,以估計標題的條件和無條件分佈。在解碼時應用的引導尺度控制了最大化 p(標題|圖像) 和 p(圖像|標題) 之間的權衡。與標準貪婪解碼相比,使用引導尺度為2的解碼顯著改善了無參考指標,如CLIPScore(0.808 vs. 0.775)和在CLIP嵌入空間中的標題到圖像檢索性能(recall@1 44.6% vs. 26.5%),但惡化了標準基於參考的標題生成指標(例如,CIDEr 78.6 vs 126.1)。我們進一步探索了使用語言模型來引導解碼過程,相對於無需分類器引導所產生的參考無關與基於參考的標題生成指標的帕累托前沿,獲得了微小的改進,並顯著提高了僅在經過最小編輯的網絡數據上訓練的模型生成的標題質量。
English
Image captioning is conventionally formulated as the task of generating
captions for images that match the distribution of reference image-caption
pairs. However, reference captions in standard captioning datasets are short
and may not uniquely identify the images they describe. These problems are
further exacerbated when models are trained directly on image-alt text pairs
collected from the internet. In this work, we show that it is possible to
generate more specific captions with minimal changes to the training process.
We implement classifier-free guidance for an autoregressive captioning model by
fine-tuning it to estimate both conditional and unconditional distributions
over captions. The guidance scale applied at decoding controls a trade-off
between maximizing p(caption|image) and
p(image|caption). Compared to standard greedy decoding,
decoding with a guidance scale of 2 substantially improves reference-free
metrics such as CLIPScore (0.808 vs. 0.775) and captiontoimage retrieval
performance in the CLIP embedding space (recall@1 44.6% vs. 26.5%), but worsens
standard reference-based captioning metrics (e.g., CIDEr 78.6 vs 126.1). We
further explore the use of language models to guide the decoding process,
obtaining small improvements over the Pareto frontier of reference-free vs.
reference-based captioning metrics that arises from classifier-free guidance,
and substantially improving the quality of captions generated from a model
trained only on minimally curated web data.