ChatPaper.aiChatPaper

分层编解码扩散模型在视频到语音生成中的应用

Hierarchical Codec Diffusion for Video-to-Speech Generation

April 17, 2026
作者: Jiaxin Ye, Gaoxiang Cong, Chenhui Wang, Xin-Cheng Wen, Zhaoyang Li, Boyuan Cao, Hongming Shan
cs.AI

摘要

视频语音生成(VTS)旨在从无声视频中合成语音信号。然而,现有VTS方法忽视了语音的层级特性——从粗粒度的说话人语义到细粒度的韵律细节。这种忽视导致在属性匹配过程中,视觉特征与语音特征难以在特定层级实现直接对齐。本文基于残差向量量化(RVQ)编解码器的层级结构,提出HiCoDiT这一新型分层编解码扩散变换器,通过利用离散语音令牌的固有层级特性实现强视听觉对齐。具体而言,由于底层令牌编码粗粒度的说话人语义,而高层令牌捕获细粒度韵律,HiCoDiT采用低层与高层模块分别生成不同层级的令牌。低层模块基于唇部同步运动和面部身份特征来捕捉说话人相关的内容,而高层模块则利用面部表情调节韵律动态。最后,为实现更有效的由粗到细的条件控制,我们提出双尺度自适应实例层归一化方法,通过通道维度归一化联合捕获全局音色风格,并通过时间维度归一化捕捉局部韵律动态。大量实验表明,HiCoDiT在保真度与表现力上均优于基线方法,彰显了离散建模在VTS任务中的潜力。代码与语音示例均已开源:https://github.com/Jiaxin-Ye/HiCoDiT。
English
Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders direct alignment between visual and speech features at specific hierarchical levels during property matching. In this paper, leveraging the hierarchical structure of Residual Vector Quantization (RVQ)-based codec, we propose HiCoDiT, a novel Hierarchical Codec Diffusion Transformer that exploits the inherent hierarchy of discrete speech tokens to achieve strong audio-visual alignment. Specifically, since lower-level tokens encode coarse speaker-aware semantics and higher-level tokens capture fine-grained prosody, HiCoDiT employs low-level and high-level blocks to generate tokens at different levels. The low-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware content, while the high-level blocks use facial expression to modulate prosodic dynamics. Finally, to enable more effective coarse-to-fine conditioning, we propose a dual-scale adaptive instance layer normalization that jointly captures global vocal style through channel-wise normalization and local prosody dynamics through temporal-wise normalization. Extensive experiments demonstrate that HiCoDiT outperforms baselines in fidelity and expressiveness, highlighting the potential of discrete modelling for VTS. The code and speech demo are both available at https://github.com/Jiaxin-Ye/HiCoDiT.