カスタマイズされたマルチモーダルロールプレイに向けて

要旨

統合マルチモーダル理解・生成モデルは、より豊かな人間-AIインタラクションを可能にする。しかし、モダリティ間で出力の一貫性を維持しながら、キャラクターのペルソナ、対話スタイル、視覚的アイデンティティを同時にカスタマイズすることは、ほとんど未開拓のままである。このギャップを埋めるために、我々は新たなタスクであるCustomized Multimodal Role-Play（CMRP）を導入する。我々は20のキャラクターからなるRoleScape-20データセットを構築する。このデータセットは、ペルソナ、スタイル記述、視覚的・表現的手がかり、テキスト-画像相互作用をカバーするトレーニングデータと評価データを含む。統一モデルに基づき、我々はUniCharacterを考案する。これは、Unified Supervised Finetuning（Unified-SFT）とcharacter-specific group relative policy optimization（Character-GRPO）を含む二段階のトレーニングフレームワークである。わずか10枚の画像とそれに対応するインタラクション例を与えるだけで、モデルは対象キャラクターを獲得し、生成テキストと画像の両方で一貫性のあるペルソナ、スタイル、視覚的アイデンティティを示す。このプロセスには約100GPU時間を要する。RoleScape-20データセットでの実験により、提案手法が従来手法を大幅に上回ることが示された。アブレーション研究により、我々のクロスモーダル一貫性設計と少数ショットカスタマイズ戦略の有効性がさらに検証された。我々は、CMRPが統一モデリングと組み合わさることで、次世代の個性的で没入感のあるインタラクティブエージェントの基盤を提供すると主張する。

English

Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape-20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross-modal consistency design and few-shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next-generation characterful and immersive interactive agents.