Stark：具備人格常識知識的社交長期多模對話

摘要

人們通過即時通訊工具在對話中分享與個人經驗相關的各種圖像。然而，現有的研究主要關注於（1）單一會話中的圖像分享行為，導致長期社交互動有限，以及（2）缺乏個性化的圖像分享行為。在本研究中，我們介紹了Stark，一個大規模長期多模態對話數據集，涵蓋多種社交人物角色、多模態格式、時間間隔和圖像。為了自動構建Stark，我們提出了一個新穎的多模態情境化框架Mcu，從ChatGPT和我們提出的Plan-and-Execute圖像對齊器中提取出長期多模態對話。利用我們的Stark，我們訓練了一個多模態對話模型Ultron 7B，展示出令人印象深刻的視覺想像能力。此外，我們展示了我們數據集在人類評估中的有效性。我們將我們的源代碼和數據集公開提供。

English

Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social interaction, and (2) a lack of personalized image-sharing behavior. In this work, we introduce Stark, a large-scale long-term multi-modal conversation dataset that covers a wide range of social personas in a multi-modality format, time intervals, and images. To construct Stark automatically, we propose a novel multi-modal contextualization framework, Mcu, that generates long-term multi-modal dialogue distilled from ChatGPT and our proposed Plan-and-Execute image aligner. Using our Stark, we train a multi-modal conversation model, Ultron 7B, which demonstrates impressive visual imagination ability. Furthermore, we demonstrate the effectiveness of our dataset in human evaluation. We make our source code and dataset publicly available.

Stark：具備人格常識知識的社交長期多模對話

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

摘要

Support