Stark: パーソナリティを備えた社会的長期マルチモーダル対話コモンセンス知識

要旨

人間は、インスタントメッセージングツールを通じて、個人的な経験に関連する多様な画像を会話の中で共有します。しかし、既存の研究は、(1)単一セッションにおける画像共有行動に焦点を当てており、長期的な社会的相互作用が限定的であり、(2)パーソナライズされた画像共有行動が欠如しているという課題があります。本研究では、多様な社会的ペルソナをカバーし、マルチモーダル形式、時間間隔、画像を含む大規模な長期マルチモーダル会話データセット「Stark」を紹介します。Starkを自動的に構築するために、ChatGPTと提案したPlan-and-Execute画像アライナーから抽出された長期マルチモーダル対話を生成する新しいマルチモーダル文脈化フレームワーク「Mcu」を提案します。Starkを使用して、視覚的想像力に優れたマルチモーダル会話モデル「Ultron 7B」を訓練します。さらに、人間による評価を通じてデータセットの有効性を実証します。ソースコードとデータセットを公開しています。

English

Humans share a wide variety of images related to their personal experiences within conversations via instant messaging tools. However, existing works focus on (1) image-sharing behavior in singular sessions, leading to limited long-term social interaction, and (2) a lack of personalized image-sharing behavior. In this work, we introduce Stark, a large-scale long-term multi-modal conversation dataset that covers a wide range of social personas in a multi-modality format, time intervals, and images. To construct Stark automatically, we propose a novel multi-modal contextualization framework, Mcu, that generates long-term multi-modal dialogue distilled from ChatGPT and our proposed Plan-and-Execute image aligner. Using our Stark, we train a multi-modal conversation model, Ultron 7B, which demonstrates impressive visual imagination ability. Furthermore, we demonstrate the effectiveness of our dataset in human evaluation. We make our source code and dataset publicly available.

Stark: パーソナリティを備えた社会的長期マルチモーダル対話コモンセンス知識

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

要旨

Support