MIMO：具有空间分解建模的可控角色视频合成

摘要

角色视频合成旨在生成逼真的动画角色视频，使其置身逼真场景中。作为计算机视觉和图形学领域的一个基础问题，3D作品通常需要多视角捕获进行个案训练，这严重限制了其对于在短时间内对任意角色建模的适用性。最近的2D方法通过预训练扩散模型打破了这一限制，但它们在姿势普遍性和场景交互方面存在困难。为此，我们提出了MIMO，这是一个新颖的框架，不仅可以根据简单用户输入合成具有可控属性（即角色、动作和场景）的角色视频，还可以同时实现对任意角色的高度可扩展性、对新颖3D动作的普适性以及适用于交互式现实场景的能力。其核心思想是将2D视频编码为紧凑的空间代码，考虑到视频发生的固有3D性质。具体而言，我们使用单目深度估计器将2D帧像素提升为3D，并根据3D深度将视频剪辑分解为三个空间组件（即主要人物、底层场景和浮动遮挡），这些组件进一步编码为规范身份代码、结构化动作代码和完整场景代码，这些代码被用作合成过程的控制信号。空间分解建模的设计实现了灵活的用户控制、复杂的动作表达，以及对场景交互的3D感知合成。实验结果证明了所提方法的有效性和鲁棒性。

English

Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. The design of spatial decomposed modeling enables flexible user control, complex motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate effectiveness and robustness of the proposed method.

MIMO：具有空间分解建模的可控角色视频合成

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

摘要

Support