맞춤형 멀티모달 역할극을 향하여

초록

통합된 다중 모달 이해 및 생성 모델은 더 풍부한 인간-AI 상호작용을 가능하게 한다. 그러나 캐릭터의 정체성, 대화 스타일, 시각적 정체성을 공동으로 맞춤 설정하면서도 모달리티 간 출력 일관성을 유지하는 방법은 아직 충분히 탐구되지 않았다. 이러한 격차를 해소하기 위해, 우리는 새로운 과제인 맞춤형 다중 모달 역할극(Customized Multimodal Role-Play, CMRP)을 도입한다. 우리는 20개의 캐릭터로 구성된 RoleScape-20 데이터셋을 구축하였으며, 이 데이터셋은 캐릭터의 정체성, 스타일 설명, 시각적/표현적 단서, 텍스트-이미지 상호작용을 포괄하는 학습 및 평가 데이터를 포함한다. 통합 모델을 기반으로, 우리는 통합 지도 미세 조정(Unified Supervised Finetuning, Unified-SFT)과 캐릭터 특화 그룹 상대 정책 최적화(Character-specific Group Relative Policy Optimization, Character-GRPO)로 구성된 2단계 학습 프레임워크인 UniCharacter를 설계하였다. 단 10장의 이미지와 이에 상응하는 상호작용 예시만으로도 모델은 목표 캐릭터를 습득하고, 생성된 텍스트와 이미지 모두에서 일관된 정체성, 스타일, 시각적 정체성을 나타낸다. 이 과정은 약 100 GPU 시간이 소요된다. RoleScape-20 데이터셋에 대한 실험 결과, 제안된 방법이 기존 접근법을 크게 능가함을 보여준다. 제거 실험을 통해 교차 모달 일관성 설계와 소수 샷 맞춤 설정 전략의 효과성을 추가로 검증하였다. 우리는 CMRP가 통합 모델링과 결합하여 차세대 개성 있고 몰입감 있는 대화형 에이전트의 기반을 제공한다고 주장한다.

English

Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape-20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross-modal consistency design and few-shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next-generation characterful and immersive interactive agents.