語義生成:語義空間中的影片生成
SemanticGen: Video Generation in Semantic Space
December 23, 2025
作者: Jianhong Bai, Xiaoshi Wu, Xintao Wang, Fu Xiao, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai
cs.AI
摘要
當前最先進的影片生成模型通常學習VAE空間中的影片潛在變量分佈,並透過VAE解碼器將其映射至像素。雖然這種方法能生成高品質影片,但存在收斂速度慢的問題,且在生成長影片時計算成本高昂。本文提出SemanticGen,透過在語義空間中生成影片來解決這些限制。我們的核心洞見在於:由於影片固有的冗餘性,生成過程應始於緊湊的高層語義空間進行全域規劃,再添加高頻細節,而非直接使用雙向注意力建模大量低階影片標記。SemanticGen採用兩階段生成流程:第一階段由擴散模型生成緊湊的語義影片特徵,定義影片的全域佈局;第二階段由另一擴散模型基於這些語義特徵生成VAE潛在變量以產生最終輸出。我們觀察到,相較於VAE潛在空間,語義空間中的生成能實現更快的收斂速度。本方法在擴展至長影片生成時亦展現出卓越效能與計算效率。大量實驗證明,SemanticGen能產出高品質影片,其表現優於當前最先進的方法與強基線模型。
English
State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.