检索增强图像字幕生成中的检索稳健性
Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning
June 4, 2024
作者: Wenyan Li, Jiaang Li, Rita Ramos, Raphael Tang, Desmond Elliott
cs.AI
摘要
最近在图像字幕检索增强模型方面取得的进展突显了通过检索相关字幕来实现高效、轻量级模型以及强大领域迁移能力的好处。虽然这些模型展示了检索增强的成功,但实际中检索模型仍然存在不足:检索到的信息有时会误导模型,导致生成不正确和性能较差。本文分析了一种名为SmallCap的检索增强字幕模型的鲁棒性。我们的分析显示,该模型对出现在大多数检索到的字幕中的标记敏感,输入归因显示这些标记很可能被复制到生成的输出中。基于这些发现,我们建议通过从更多不同集合中对检索到的字幕进行抽样来训练模型。这降低了模型学习复制大多数标记的可能性,并提高了领域内和跨领域性能。
English
Recent advances in retrieval-augmented models for image captioning highlight
the benefit of retrieving related captions for efficient, lightweight models
with strong domain-transfer capabilities. While these models demonstrate the
success of retrieval augmentation, retrieval models are still far from perfect
in practice: the retrieved information can sometimes mislead the model,
resulting in incorrect generation and worse performance. In this paper, we
analyze the robustness of a retrieval-augmented captioning model SmallCap. Our
analysis shows that the model is sensitive to tokens that appear in the
majority of the retrieved captions, and the input attribution shows that those
tokens are likely copied into the generated output. Given these findings, we
propose to train the model by sampling retrieved captions from more diverse
sets. This decreases the chance that the model learns to copy majority tokens,
and improves both in-domain and cross-domain performance.Summary
AI-Generated Summary