Uni3C：统一精确的3D增强相机与人体运动控制以实现视频生成

摘要

相机与人体运动控制在视频生成领域已得到广泛研究，但现有方法通常分别处理这两者，面临高质量标注数据有限的挑战。为此，我们提出了Uni3C，一个统一的三维增强框架，旨在视频生成中精确控制相机与人体运动。Uni3C包含两大核心贡献。首先，我们设计了一个即插即用的控制模块PCDController，它基于冻结的视频生成主干网络进行训练，利用单目深度反投影得到的点云实现精准的相机控制。通过结合点云的强大三维先验知识与视频基础模型的卓越能力，PCDController展现了出色的泛化性能，无论推理主干网络是冻结还是微调，均能表现优异。这种灵活性使得Uni3C的不同模块能够在特定领域（即相机控制或人体运动控制）独立训练，降低了对联合标注数据的依赖。其次，我们提出了一种联合对齐的三维世界引导机制，在推理阶段无缝整合场景点云与SMPL-X角色模型，分别统一相机与人体运动的控制信号。大量实验证实，PCDController在驱动视频生成微调主干网络的相机运动方面表现出极强的鲁棒性。Uni3C在相机可控性与人体运动质量上均显著超越竞争对手。此外，我们还收集了包含挑战性相机移动与人体动作的定制验证集，以验证我们方法的有效性。

English

Camera and human motion controls have been extensively studied for video generation, but existing approaches typically address them separately, suffering from limited data with high-quality annotations for both aspects. To overcome this, we present Uni3C, a unified 3D-enhanced framework for precise control of both camera and human motion in video generation. Uni3C includes two key contributions. First, we propose a plug-and-play control module trained with a frozen video generative backbone, PCDController, which utilizes unprojected point clouds from monocular depth to achieve accurate camera control. By leveraging the strong 3D priors of point clouds and the powerful capacities of video foundational models, PCDController shows impressive generalization, performing well regardless of whether the inference backbone is frozen or fine-tuned. This flexibility enables different modules of Uni3C to be trained in specific domains, i.e., either camera control or human motion control, reducing the dependency on jointly annotated data. Second, we propose a jointly aligned 3D world guidance for the inference phase that seamlessly integrates both scenic point clouds and SMPL-X characters to unify the control signals for camera and human motion, respectively. Extensive experiments confirm that PCDController enjoys strong robustness in driving camera motion for fine-tuned backbones of video generation. Uni3C substantially outperforms competitors in both camera controllability and human motion quality. Additionally, we collect tailored validation sets featuring challenging camera movements and human actions to validate the effectiveness of our method.