UniGeo:通过视频模型实现相机可控图像编辑的统一几何引导
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
April 19, 2026
作者: Hong Jiang, Wensong Song, Zongxing Yang, Ruijie Quan, Yi Yang
cs.AI
摘要
相机可控图像编辑旨在实现给定场景在不同相机姿态下的新视角合成,同时严格保持跨视角的几何一致性。然而,现有方法通常依赖碎片化的几何指导——例如尽管模型包含多个层级,却仅在表征层面注入点云数据,且主要基于处理离散视角映射的图像扩散模型。这两大局限共同导致连续相机运动下的几何漂移与结构退化。
我们发现,虽然利用视频模型能为相机可控编辑提供连续视角先验,但若几何指导仍呈碎片化,模型仍难以形成稳定的几何理解。为系统解决该问题,我们提出在共同决定生成效果的三个层级(表征、架构与损失函数)中注入统一的几何指导。
基于此,我们提出新型相机可控编辑框架UniGeo。具体而言:在表征层面,UniGeo采用帧解耦的几何参考注入机制,提供鲁棒的跨视角几何上下文;在架构层面,引入几何锚点注意力以实现多视角特征对齐;在损失函数层面,提出轨迹端点几何监督策略,显式强化目标视角的结构保真度。
在涵盖广角与受限相机运动场景的多个公开基准测试中,综合实验表明UniGeo在视觉质量与几何一致性方面均显著优于现有方法。
English
Camera-controllable image editing aims to synthesize novel views of a given scene under varying camera poses while strictly preserving cross-view geometric consistency. However, existing methods typically rely on fragmented geometric guidance, such as only injecting point clouds at the representation level despite models containing multiple levels, and are mainly based on image diffusion models that operate on discrete view mappings. These two limitations jointly lead to geometric drift and structural degradation under continuous camera motion.
We observe that while leveraging video models provides continuous viewpoint priors for camera-controllable image editing, they still struggle to form stable geometric understanding if geometric guidance remains fragmented. To systematically address this, we inject unified geometric guidance across three levels that jointly determine the generative output: representation, architecture, and loss function.
To this end, we propose UniGeo, a novel camera-controllable editing framework. Specifically, at the representation level, UniGeo incorporates a frame-decoupled geometric reference injection mechanism to provide robust cross-view geometry context. At the architecture level, it introduces geometric anchor attention to align multi-view features. At the loss function level, it proposes a trajectory-endpoint geometric supervision strategy to explicitly reinforce the structural fidelity of target views.
Comprehensive experiments across multiple public benchmarks, encompassing both extensive and limited camera motion settings, demonstrate that UniGeo significantly outperforms existing methods in both visual quality and geometric consistency.