ChatPaper.aiChatPaper

基于视频扩散先验的生成式神经视频压缩

Generative Neural Video Compression via Video Diffusion Prior

December 4, 2025
作者: Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, Siwei Ma
cs.AI

摘要

我们提出了GNVC-VD——首个基于DiT架构的生成式神经视频压缩框架,该框架构建于先进的视频生成基础模型之上,将时空潜在表示压缩与序列级生成式优化统一集成于单一编解码器中。现有感知编解码器主要依赖预训练的图像生成先验来恢复高频细节,但其逐帧处理特性缺乏时序建模,不可避免地会导致感知闪烁现象。为解决这一问题,GNVC-VD引入了统一的流匹配潜在优化模块,通过视频扩散变换器实现序列级去噪,联合增强帧内与帧间潜在表示,从而确保时空细节的一致性。与视频生成中从纯高斯噪声开始去噪的方式不同,GNVC-VD从解码后的时空潜在表示初始化优化过程,并学习使扩散先验适应压缩引发质量退化的修正项。条件适配器进一步将压缩感知线索注入中间DiT层,在极端码率约束下既能实现有效的伪影消除,又能保持时序连贯性。大量实验表明,GNVC-VD在感知质量上超越传统与学习型编解码器,显著改善了现有生成式方法中持续存在的闪烁伪影,即使在低于0.01 bpp的码率下仍能保持优异性能,这彰显了将视频原生生成先验整合到神经编解码器中以实现下一代感知视频压缩的巨大潜力。
English
We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.
PDF51December 6, 2025