LaMP-Cap: 다중모드 도형 프로파일을 활용한 개인 맞춤형 도형 설명 생성

초록

그림 설명은 독자가 그림의 주요 메시지를 이해하고 기억하는 데 중요한 역할을 합니다. 이러한 설명을 생성하기 위해 다양한 모델이 개발되어 저자들이 더 나은 품질의 설명을 더 쉽게 작성할 수 있도록 돕고 있습니다. 그러나 저자들은 거의 항상 일반적인 AI 생성 설명을 자신의 글쓰기 스타일과 해당 분야의 스타일에 맞게 수정해야 하므로, 개인화의 필요성이 강조됩니다. 언어 모델의 개인화(LaMP) 기술이 발전했음에도 불구하고, 이러한 기술들은 주로 텍스트만을 다루는 환경에 초점을 맞추고 있으며, 입력과 프로필이 모두 다중 모드인 시나리오를 거의 다루지 않습니다. 본 논문은 다중 모드 그림 프로필을 활용한 개인화된 그림 설명 생성을 위한 데이터셋인 LaMP-Cap을 소개합니다. LaMP-Cap은 각 대상 그림에 대해 필요한 입력(예: 그림 이미지)뿐만 아니라 동일한 문서에서 가져온 최대 세 개의 다른 그림(각각의 이미지, 설명, 그림을 언급한 문단)을 프로필로 제공하여 문맥을 특성화합니다. 네 가지 대형 언어 모델(LLM)을 사용한 실험 결과, 프로필 정보를 사용하면 원본 저자가 작성한 설명에 더 가까운 설명을 생성하는 데 일관적으로 도움이 되는 것으로 나타났습니다. 추가 연구를 통해 프로필의 이미지가 그림을 언급한 문단보다 더 유용하다는 것이 밝혀졌으며, 이는 텍스트만 사용한 프로필보다 다중 모드 프로필을 사용하는 이점을 강조합니다.

English

Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain's style, highlighting the need for personalization. Despite language models' personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document--each with its image, caption, and figure-mentioning paragraphs--as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.