SemanticGen:语义空间中的视频生成
SemanticGen: Video Generation in Semantic Space
December 23, 2025
作者: Jianhong Bai, Xiaoshi Wu, Xintao Wang, Fu Xiao, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai
cs.AI
摘要
当前先进的视频生成模型通常学习视频在VAE潜空间中的分布,并通过VAE解码器将其映射为像素。虽然这种方法能生成高质量视频,但存在收敛速度慢、生成长视频时计算成本高的问题。本文提出SemanticGen这一创新解决方案,通过在语义空间生成视频来突破这些限制。我们的核心思路是:由于视频本身存在固有冗余性,生成过程应当始于紧凑的高层语义空间进行全局规划,再逐步添加高频细节,而非直接使用双向注意力对海量低层视频令牌进行建模。SemanticGen采用两阶段生成流程:第一阶段通过扩散模型生成紧凑的语义视频特征,定义视频的全局布局;第二阶段由另一个扩散模型基于这些语义特征生成VAE潜变量以产生最终输出。我们观察到,与VAE潜空间相比,语义空间中的生成具有更快的收敛速度。本方法在扩展至长视频生成时仍能保持高效性与计算经济性。大量实验表明,SemanticGen能生成高质量视频,其性能优于现有先进方法和强基线模型。
English
State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.