LaMP-Cap:基於多模態人物特徵的個性化人物圖像描述生成
LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
June 6, 2025
作者: Ho Yin 'Sam' Ng, Ting-Yao Hsu, Aashish Anantha Ramakrishnan, Branislav Kveton, Nedim Lipka, Franck Dernoncourt, Dongwon Lee, Tong Yu, Sungchul Kim, Ryan A. Rossi, Ting-Hao 'Kenneth' Huang
cs.AI
摘要
圖表說明對於幫助讀者理解並記住圖表的關鍵訊息至關重要。許多模型已被開發用於生成這些說明,協助作者更輕鬆地撰寫出更高品質的說明。然而,作者幾乎總是需要修改通用的AI生成說明,以符合其寫作風格和領域風格,這凸顯了個人化的需求。儘管語言模型的個人化(LaMP)技術取得了進展,這些技術通常專注於純文本的設定,很少處理輸入和個人資料皆為多模態的情境。本文介紹了LaMP-Cap,這是一個用於多模態圖表個人資料的個性化圖表說明生成的數據集。對於每個目標圖表,LaMP-Cap不僅提供了所需的輸入,如圖表圖像,還提供了來自同一文件的多達三個其他圖表——每個圖表都包含其圖像、說明和提及圖表的段落——作為描述上下文的個人資料。使用四個大型語言模型(LLM)的實驗表明,利用個人資料信息一致地幫助生成更接近原作者撰寫的說明。消融研究顯示,個人資料中的圖像比提及圖表的段落更有幫助,這凸顯了使用多模態個人資料相較於純文本個人資料的優勢。
English
Figure captions are crucial for helping readers understand and remember a
figure's key message. Many models have been developed to generate these
captions, helping authors compose better quality captions more easily. Yet,
authors almost always need to revise generic AI-generated captions to match
their writing style and the domain's style, highlighting the need for
personalization. Despite language models' personalization (LaMP) advances,
these technologies often focus on text-only settings and rarely address
scenarios where both inputs and profiles are multimodal. This paper introduces
LaMP-Cap, a dataset for personalized figure caption generation with multimodal
figure profiles. For each target figure, LaMP-Cap provides not only the needed
inputs, such as figure images, but also up to three other figures from the same
document--each with its image, caption, and figure-mentioning paragraphs--as a
profile to characterize the context. Experiments with four LLMs show that using
profile information consistently helps generate captions closer to the original
author-written ones. Ablation studies reveal that images in the profile are
more helpful than figure-mentioning paragraphs, highlighting the advantage of
using multimodal profiles over text-only ones.