RONA: 일관성 관계를 활용한 실용적으로 다양한 이미지 캡셔닝

초록

작성 보조 도구(예: Grammarly, Microsoft Copilot)는 전통적으로 이미지 구성 요소를 설명하기 위해 구문적 및 의미적 변형을 활용하여 다양한 이미지 캡션을 생성합니다. 그러나 인간이 작성한 캡션은 시각적 설명과 함께 중심 메시지를 전달하는 데 중점을 두며, 이를 위해 실용적 단서를 사용합니다. 실용적 다양성을 향상시키기 위해서는 시각적 콘텐츠와 함께 이러한 메시지를 전달하는 대체 방법을 탐구하는 것이 필수적입니다. 이 문제를 해결하기 위해, 우리는 Coherence Relations(일관성 관계)를 변형 축으로 활용하는 새로운 프롬프팅 전략인 RONA를 제안합니다. RONA는 여러 도메인에서 MLLM(Multi-modal Large Language Models) 기준선과 비교하여 전반적으로 더 나은 다양성과 실제 데이터와의 일치성을 가진 캡션을 생성함을 입증합니다. 우리의 코드는 https://github.com/aashish2000/RONA에서 확인할 수 있습니다.

English

Writing Assistants (e.g., Grammarly, Microsoft Copilot) traditionally generate diverse image captions by employing syntactic and semantic variations to describe image components. However, human-written captions prioritize conveying a central message alongside visual descriptions using pragmatic cues. To enhance pragmatic diversity, it is essential to explore alternative ways of communicating these messages in conjunction with visual content. To address this challenge, we propose RONA, a novel prompting strategy for Multi-modal Large Language Models (MLLM) that leverages Coherence Relations as an axis for variation. We demonstrate that RONA generates captions with better overall diversity and ground-truth alignment, compared to MLLM baselines across multiple domains. Our code is available at: https://github.com/aashish2000/RONA