GimbalDiffusion:面向视频生成的重力感知相机控制系统
GimbalDiffusion: Gravity-Aware Camera Control for Video Generation
December 9, 2025
作者: Frédéric Fortier-Chouinard, Yannick Hold-Geoffroy, Valentin Deschaintre, Matheus Gadelha, Jean-François Lalonde
cs.AI
摘要
近年來,文本到視頻生成技術雖已實現顯著的真實感,但對攝像機運動與朝向的細粒度控制仍具挑戰。現有方法通常通過相對或模糊表徵來編碼攝像機軌跡,限制了顯式幾何控制。我們提出GimbalDiffusion框架,該框架基於物理世界坐標系實現攝像機控制,並以重力作為全局參考。與傳統根據前一幀定義相對運動的方式不同,我們的方法在絕對坐標系中定義攝像機軌跡,無需初始參考幀即可實現精確且可解釋的攝像機參數控制。我們利用全景360度視頻構建多樣化的攝像機軌跡,大幅超越傳統視頻數據中主要存在的直線前向運動軌跡。為強化攝像機引導,我們引入零俯仰條件標注策略,當文本內容與攝像機規格衝突時(例如攝像機朝向天空卻需生成草地),該策略能降低模型對文本內容的依賴性。最後,我們通過重平衡SpatialVID-HQ數據集建立了攝像機感知視頻生成基準,用於大範圍攝像機俯仰變化下的綜合評估。這些創新共同推動了文本到視頻模型的可控性與魯棒性,實現生成框架內精確對齊重力的攝像機操控。
English
Recent progress in text-to-video generation has achieved remarkable realism, yet fine-grained control over camera motion and orientation remains elusive. Existing approaches typically encode camera trajectories through relative or ambiguous representations, limiting explicit geometric control. We introduce GimbalDiffusion, a framework that enables camera control grounded in physical-world coordinates, using gravity as a global reference. Instead of describing motion relative to previous frames, our method defines camera trajectories in an absolute coordinate system, allowing precise and interpretable control over camera parameters without requiring an initial reference frame. We leverage panoramic 360-degree videos to construct a wide variety of camera trajectories, well beyond the predominantly straight, forward-facing trajectories seen in conventional video data. To further enhance camera guidance, we introduce null-pitch conditioning, an annotation strategy that reduces the model's reliance on text content when conflicting with camera specifications (e.g., generating grass while the camera points towards the sky). Finally, we establish a benchmark for camera-aware video generation by rebalancing SpatialVID-HQ for comprehensive evaluation under wide camera pitch variation. Together, these contributions advance the controllability and robustness of text-to-video models, enabling precise, gravity-aligned camera manipulation within generative frameworks.