より具体的なキャプションへと画像キャプションモデルを導く

要旨

画像キャプショニングは、従来、参照用の画像とキャプションのペアの分布に一致するキャプションを生成するタスクとして定式化されてきました。しかし、標準的なキャプショニングデータセットの参照キャプションは短く、記述する画像を一意に識別できない場合があります。これらの問題は、インターネットから収集された画像と代替テキストのペアで直接モデルを訓練する場合、さらに悪化します。本研究では、訓練プロセスに最小限の変更を加えることで、より具体的なキャプションを生成できることを示します。自己回帰型キャプショニングモデルに対して、条件付きおよび無条件のキャプション分布を推定するように微調整することで、分類器不要のガイダンスを実装します。デコード時に適用されるガイダンススケールは、p(キャプション|画像)とp(画像|キャプション)の最大化の間のトレードオフを制御します。標準的な貪欲デコードと比較して、ガイダンススケール2でのデコードは、CLIPScore（0.808対0.775）やCLIP埋め込み空間でのキャプションから画像の検索性能（recall@1 44.6%対26.5%）などの参照不要の指標を大幅に改善しますが、標準的な参照ベースのキャプショニング指標（例：CIDEr 78.6対126.1）は悪化させます。さらに、言語モデルを使用してデコードプロセスをガイドする方法を探り、分類器不要のガイダンスから生じる参照不要と参照ベースのキャプショニング指標のパレートフロンティアをわずかに改善し、最小限に整理されたウェブデータのみで訓練されたモデルから生成されるキャプションの品質を大幅に向上させます。

English

Image captioning is conventionally formulated as the task of generating captions for images that match the distribution of reference image-caption pairs. However, reference captions in standard captioning datasets are short and may not uniquely identify the images they describe. These problems are further exacerbated when models are trained directly on image-alt text pairs collected from the internet. In this work, we show that it is possible to generate more specific captions with minimal changes to the training process. We implement classifier-free guidance for an autoregressive captioning model by fine-tuning it to estimate both conditional and unconditional distributions over captions. The guidance scale applied at decoding controls a trade-off between maximizing p(caption|image) and p(image|caption). Compared to standard greedy decoding, decoding with a guidance scale of 2 substantially improves reference-free metrics such as CLIPScore (0.808 vs. 0.775) and captiontoimage retrieval performance in the CLIP embedding space (recall@1 44.6% vs. 26.5%), but worsens standard reference-based captioning metrics (e.g., CIDEr 78.6 vs 126.1). We further explore the use of language models to guide the decoding process, obtaining small improvements over the Pareto frontier of reference-free vs. reference-based captioning metrics that arises from classifier-free guidance, and substantially improving the quality of captions generated from a model trained only on minimally curated web data.

より具体的なキャプションへと画像キャプションモデルを導く

Guiding Image Captioning Models Toward More Specific Captions

要旨

Support