ChatPaper.aiChatPaper

无需结构指导的端到端视频角色替换

End-to-End Video Character Replacement without Structural Guidance

January 13, 2026
作者: Zhengbo Xu, Jie Ma, Ziheng Wang, Zhan Peng, Jun Liang, Jing Li
cs.AI

摘要

由于缺乏成对的视频数据,如何基于用户提供的身份信息实现可控的视频角色替换仍是一个具有挑战性的问题。现有研究主要依赖基于重建的范式,需要逐帧分割掩码和显式结构指导(如骨骼、深度信息)。然而,这种依赖性严重限制了方法在复杂场景中的泛化能力,例如存在遮挡、角色-物体交互、非常规姿态或复杂光照的情况,往往导致视觉伪影和时间不一致性。本文提出创新框架MoCha,通过仅需单帧任意掩码即可突破这些限制。为有效适配多模态输入条件并增强面部身份特征,我们引入了条件感知的RoPE机制,并采用基于强化学习的后训练阶段。此外,为解决合格配对训练数据稀缺的问题,我们设计了完整的数据构建流程,具体创建了三个专用数据集:基于虚幻引擎5(UE5)构建的高保真渲染数据集、通过当前人像动画技术合成的表情驱动数据集,以及从现有视频-掩码对衍生的增强数据集。大量实验表明,本方法显著优于现有最优方法。我们将公开代码以促进后续研究,更多细节请访问项目页面:orange-3dv-team.github.io/MoCha。
English
Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a pioneering framework that bypasses these limitations by requiring only a single arbitrary frame mask. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research. Please refer to our project page for more details: orange-3dv-team.github.io/MoCha
PDF51January 15, 2026