SwiftI2V:基于条件分段生成的高效高分辨率图像转视频技术
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
May 7, 2026
作者: YaoYang Liu, Yuechen Zhang, Wenbo Li, Yufei Zhao, Rui Liu, Long Chen
cs.AI
摘要
高分辨率图像到视频(I2V)生成技术旨在合成逼真时序动态的同时,保留输入图像的精细外观细节。在2K分辨率下,该任务变得极具挑战性,现有解决方案存在诸多缺陷:1)端到端模型往往存在内存占用过高和延迟难以承受的问题;2)基于通用视频超分技术的级联式低分辨率生成方案容易产生细节失真并偏离输入图像特有的局部结构,因为超分阶段未显式关联输入图像条件。为此,我们提出SwiftI2V这一面向高分辨率I2V的高效框架。该框架采用广泛使用的两阶段设计,首先生成低分辨率运动参考以降低token计算成本并减轻建模负担,随后在运动引导下执行强图像条件约束的2K合成,从而在可控开销内还原输入保真细节。具体而言,为提升生成可扩展性,SwiftI2V引入条件分段生成技术(CSG),通过分段合成方式将单步token预算控制在可控范围内,并采用段内双向上下文交互机制以增强段间连贯性与输入保真度。在2K分辨率的VBench-I2V基准测试中,SwiftI2V在将总GPU时间缩减202倍的同时,取得了与端到端基线模型相当的性能表现。尤为突出的是,该框架可在单张数据中心级GPU(如H800)或消费级GPU(如RTX 4090)上实现实用的2K分辨率I2V生成。
English
High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).