面向定制化的多模态角色扮演

摘要

统一的多模态理解与生成模型能够实现更丰富的人机交互。然而，在跨模态保持输出一致性的同时，联合定制角色的个性、对话风格和视觉身份，在很大程度上仍未得到探索。为弥补这一空白，我们引入了一项新任务：定制化多模态角色扮演（CMRP）。我们构建了包含20个角色的RoleScape-20数据集，涵盖个性、风格描述、视觉/表情提示以及文本-图像交互的训练与评估数据。基于统一模型，我们设计了UniCharacter，这是一个包含统一监督微调（Unified-SFT）和角色特定组相对策略优化（Character-GRPO）的两阶段训练框架。仅需10张图像及对应的交互示例，模型即可习得目标角色，并在生成的文本与图像中展现出一致的个性、风格及视觉身份。该过程约需100 GPU小时。在RoleScape-20数据集上的实验表明，所提出的方法显著优于先前的方法。消融研究进一步验证了我们的跨模态一致性设计与少样本定制策略的有效性。我们认为，CMRP结合统一建模，为下一代富有角色感且沉浸式的交互智能体奠定了基础。

English

Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape-20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross-modal consistency design and few-shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next-generation characterful and immersive interactive agents.