LaMP-Cap:基于多模态人物画像的个性化人物描述生成
LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
June 6, 2025
作者: Ho Yin 'Sam' Ng, Ting-Yao Hsu, Aashish Anantha Ramakrishnan, Branislav Kveton, Nedim Lipka, Franck Dernoncourt, Dongwon Lee, Tong Yu, Sungchul Kim, Ryan A. Rossi, Ting-Hao 'Kenneth' Huang
cs.AI
摘要
图表说明对于帮助读者理解和记忆图表的关键信息至关重要。许多模型已被开发用于生成这些说明,帮助作者更轻松地撰写更高质量的说明。然而,作者几乎总是需要修改通用AI生成的说明,以匹配其写作风格和领域风格,这凸显了个性化的必要性。尽管语言模型的个性化(LaMP)取得了进展,但这些技术通常专注于纯文本设置,很少涉及输入和配置文件均为多模态的场景。本文介绍了LaMP-Cap,一个用于个性化图表说明生成的多模态图表配置文件数据集。对于每个目标图表,LaMP-Cap不仅提供了所需的输入,如图表图像,还提供了来自同一文档的至多三个其他图表——每个图表都包含其图像、说明和提及图表的段落——作为描述上下文的配置文件。使用四个大型语言模型的实验表明,利用配置文件信息一致地帮助生成更接近原作者撰写的说明。消融研究揭示,配置文件中的图像比提及图表的段落更有帮助,突出了使用多模态配置文件相较于纯文本配置文件的优势。
English
Figure captions are crucial for helping readers understand and remember a
figure's key message. Many models have been developed to generate these
captions, helping authors compose better quality captions more easily. Yet,
authors almost always need to revise generic AI-generated captions to match
their writing style and the domain's style, highlighting the need for
personalization. Despite language models' personalization (LaMP) advances,
these technologies often focus on text-only settings and rarely address
scenarios where both inputs and profiles are multimodal. This paper introduces
LaMP-Cap, a dataset for personalized figure caption generation with multimodal
figure profiles. For each target figure, LaMP-Cap provides not only the needed
inputs, such as figure images, but also up to three other figures from the same
document--each with its image, caption, and figure-mentioning paragraphs--as a
profile to characterize the context. Experiments with four LLMs show that using
profile information consistently helps generate captions closer to the original
author-written ones. Ablation studies reveal that images in the profile are
more helpful than figure-mentioning paragraphs, highlighting the advantage of
using multimodal profiles over text-only ones.