面向定制化多模态角色扮演
Towards Customized Multimodal Role-Play
May 1, 2026
作者: Chao Tang, Jianzong Wu, Qingyu Shi, Ye Tian, Aixi Zhang, Hao Jiang, Jiangning Zhang, Yunhai Tong
cs.AI
摘要
統一多模態理解與生成模型能實現更豐富的人機互動。然而,在維持跨模態輸出一致性的同時,共同自訂角色的性格、對話風格與視覺形象,至今仍鮮少被探討。為填補此缺口,我們提出一項新任務——客製化多模態角色扮演(Customized Multimodal Role-Play, CMRP),並建構 RoleScape-20 資料集,包含 20 個角色,涵蓋性格描述、風格描述、視覺/表情提示以及文字-影像互動等訓練與評估資料。基於統一模型,我們設計了 UniCharacter,這是一個兩階段訓練框架,包含統一監督微調(Unified Supervised Finetuning, Unified-SFT)與角色特定群體相對策略優化(Character-specific Group Relative Policy Optimization, Character-GRPO)。僅需 10 張影像及對應的互動範例,模型即可習得目標角色,並在生成的文字與影像中展現一致的性格、風格與視覺形象,此過程約耗時 100 GPU 小時。在 RoleScape-20 資料集上的實驗顯示,所提方法顯著優於既有方法。消融實驗進一步驗證了我們跨模態一致性設計與少樣本客製化策略的有效性。我們認為,CMRP 結合統一建模,為下一世代具角色特色且沉浸式的互動代理提供了基礎。
English
Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape-20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross-modal consistency design and few-shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next-generation characterful and immersive interactive agents.