LaMP-Cap: マルチモーダル図プロファイルを用いたパーソナライズド図キャプション生成

要旨

図のキャプションは、読者が図の主要なメッセージを理解し記憶する上で重要な役割を果たします。多くのモデルがこれらのキャプションを生成するために開発され、著者がより高品質なキャプションを容易に作成するのを支援しています。しかし、著者はほぼ常に、AIが生成した一般的なキャプションを、自身の執筆スタイルや分野のスタイルに合わせて修正する必要があり、パーソナライゼーションの必要性が浮き彫りになっています。言語モデルのパーソナライゼーション（LaMP）の進展にもかかわらず、これらの技術はテキストのみの設定に焦点を当てることが多く、入力とプロファイルの両方がマルチモーダルであるシナリオに対応することは稀です。本論文では、マルチモーダルな図プロファイルを伴うパーソナライズされた図キャプション生成のためのデータセットであるLaMP-Capを紹介します。LaMP-Capでは、各ターゲット図に対して、必要な入力（図画像など）だけでなく、同じ文書から最大3つの他の図（それぞれに画像、キャプション、図に言及する段落を含む）をプロファイルとして提供し、文脈を特徴づけます。4つのLLMを用いた実験では、プロファイル情報を使用することで、一貫してオリジナルの著者作成キャプションに近いキャプションを生成できることが示されました。アブレーションスタディでは、プロファイル内の画像が図に言及する段落よりも有用であることが明らかになり、テキストのみのプロファイルよりもマルチモーダルなプロファイルを使用する利点が強調されました。

English

Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain's style, highlighting the need for personalization. Despite language models' personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document--each with its image, caption, and figure-mentioning paragraphs--as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.