RONA：一貫性関係を用いた実用的に多様な画像キャプショニング

要旨

ライティングアシスタント（例：Grammarly、Microsoft Copilot）は、従来、構文的および意味的なバリエーションを用いて画像の構成要素を説明することで、多様な画像キャプションを生成してきました。しかし、人間が書くキャプションは、視覚的な描写とともに中心的なメッセージを伝えることを優先し、実用的な手がかりを活用します。実用的な多様性を高めるためには、視覚的コンテンツと併せてこれらのメッセージを伝える代替方法を探ることが重要です。この課題に対処するため、我々は、一貫性関係（Coherence Relations）をバリエーションの軸として活用する、マルチモーダル大規模言語モデル（MLLM）向けの新しいプロンプティング戦略「RONA」を提案します。我々は、RONAが複数のドメインにわたってMLLMベースラインと比較し、全体的な多様性とグラウンドトゥルース（ground-truth）との整合性が優れたキャプションを生成することを実証します。コードは以下で公開されています：https://github.com/aashish2000/RONA

English

Writing Assistants (e.g., Grammarly, Microsoft Copilot) traditionally generate diverse image captions by employing syntactic and semantic variations to describe image components. However, human-written captions prioritize conveying a central message alongside visual descriptions using pragmatic cues. To enhance pragmatic diversity, it is essential to explore alternative ways of communicating these messages in conjunction with visual content. To address this challenge, we propose RONA, a novel prompting strategy for Multi-modal Large Language Models (MLLM) that leverages Coherence Relations as an axis for variation. We demonstrate that RONA generates captions with better overall diversity and ground-truth alignment, compared to MLLM baselines across multiple domains. Our code is available at: https://github.com/aashish2000/RONA

RONA：一貫性関係を用いた実用的に多様な画像キャプショニング

RONA: Pragmatically Diverse Image Captioning with Coherence Relations

要旨

Support