GimbalDiffusion:面向视频生成的重力感知相机控制系统
GimbalDiffusion: Gravity-Aware Camera Control for Video Generation
December 9, 2025
作者: Frédéric Fortier-Chouinard, Yannick Hold-Geoffroy, Valentin Deschaintre, Matheus Gadelha, Jean-François Lalonde
cs.AI
摘要
近期文本到视频生成技术虽已实现显著的真实感,但在相机运动与朝向的细粒度控制方面仍存挑战。现有方法通常通过相对或模糊的表征来编码相机轨迹,限制了显式几何控制。我们提出GimbalDiffusion框架,该框架基于物理世界坐标系实现相机控制,并以重力作为全局参照。与基于前一帧的相对运动描述不同,我们的方法在绝对坐标系中定义相机轨迹,无需初始参考帧即可实现精确且可解释的相机参数控制。我们利用全景360度视频构建多样化的相机轨迹,远超传统视频数据中主要存在的直线前向运动轨迹。为增强相机引导能力,我们引入零俯仰角条件标注策略,该策略能在相机参数与文本内容冲突时(如相机朝向天空却需生成草地)降低模型对文本的依赖。最后,我们通过重新平衡SpatialVID-HQ数据集建立相机感知视频生成基准,用于广俯仰角变化下的综合评估。这些成果共同提升了文本到视频模型的可控性与鲁棒性,实现了生成框架内精确且重力对齐的相机操控。
English
Recent progress in text-to-video generation has achieved remarkable realism, yet fine-grained control over camera motion and orientation remains elusive. Existing approaches typically encode camera trajectories through relative or ambiguous representations, limiting explicit geometric control. We introduce GimbalDiffusion, a framework that enables camera control grounded in physical-world coordinates, using gravity as a global reference. Instead of describing motion relative to previous frames, our method defines camera trajectories in an absolute coordinate system, allowing precise and interpretable control over camera parameters without requiring an initial reference frame. We leverage panoramic 360-degree videos to construct a wide variety of camera trajectories, well beyond the predominantly straight, forward-facing trajectories seen in conventional video data. To further enhance camera guidance, we introduce null-pitch conditioning, an annotation strategy that reduces the model's reliance on text content when conflicting with camera specifications (e.g., generating grass while the camera points towards the sky). Finally, we establish a benchmark for camera-aware video generation by rebalancing SpatialVID-HQ for comprehensive evaluation under wide camera pitch variation. Together, these contributions advance the controllability and robustness of text-to-video models, enabling precise, gravity-aligned camera manipulation within generative frameworks.