検索強化型画像キャプショニングにおける検索ロバスト性の理解

要旨

画像キャプショニングにおける検索拡張モデルの最近の進展は、関連するキャプションを検索することで、効率的で軽量なモデルを実現し、強力なドメイン転移能力を発揮する利点を強調している。これらのモデルは検索拡張の成功を示しているものの、実際には検索モデルはまだ完璧とは言えず、検索された情報がモデルを誤解させ、誤った生成や性能の低下を引き起こすことがある。本論文では、検索拡張キャプショニングモデルSmallCapの頑健性を分析する。分析の結果、モデルは検索されたキャプションの大多数に現れるトークンに対して敏感であり、入力帰属分析によると、それらのトークンは生成された出力にコピーされる傾向があることが明らかになった。これらの知見に基づき、より多様なセットから検索されたキャプションをサンプリングしてモデルを訓練することを提案する。これにより、モデルが多数派のトークンをコピーすることを学習する可能性が減少し、ドメイン内およびドメイン間の性能が向上する。

English

Recent advances in retrieval-augmented models for image captioning highlight the benefit of retrieving related captions for efficient, lightweight models with strong domain-transfer capabilities. While these models demonstrate the success of retrieval augmentation, retrieval models are still far from perfect in practice: the retrieved information can sometimes mislead the model, resulting in incorrect generation and worse performance. In this paper, we analyze the robustness of a retrieval-augmented captioning model SmallCap. Our analysis shows that the model is sensitive to tokens that appear in the majority of the retrieved captions, and the input attribution shows that those tokens are likely copied into the generated output. Given these findings, we propose to train the model by sampling retrieved captions from more diverse sets. This decreases the chance that the model learns to copy majority tokens, and improves both in-domain and cross-domain performance.

検索強化型画像キャプショニングにおける検索ロバスト性の理解

Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning

要旨

Support