ChatPaper.aiChatPaper

SoundCTM:将基于分数和一致性模型结合起来,用于文本转语音生成。

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

May 28, 2024
作者: Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji
cs.AI

摘要

声音内容是多媒体作品(如视频游戏、音乐和电影)中不可或缺的元素。最近,基于高质量扩散的声音生成模型可以作为创作者宝贵的工具。然而,尽管能够产生高质量的声音,这些模型通常在推理速度上存在缓慢的问题。这一缺点给创作者带来了负担,他们通常通过反复试验来调整声音,以使其符合艺术意图。为解决这一问题,我们引入了声音一致性轨迹模型(SoundCTM)。我们的模型实现了在高质量一步声音生成和多步生成之间的灵活过渡。这使得创作者可以在通过多步生成完善声音之前,最初使用一步样本来控制声音。虽然CTM基本上实现了灵活的一步和多步生成,但其出色的性能在很大程度上依赖于额外的预训练特征提取器和对抗损失,这些训练代价高且并非在其他领域总是可用。因此,我们重新构建了CTM的训练框架,并通过利用教师网络进行蒸馏损失,引入了一种新颖的特征距离。此外,我们在蒸馏无分类器引导轨迹的同时,同时训练有条件和无条件的学生模型,并在推理过程中在这些模型之间进行插值。我们还提出了无需训练的可控框架用于SoundCTM,利用其灵活的采样能力。SoundCTM实现了有前途的一步和多步实时声音生成,而无需使用任何额外的现成网络。此外,我们展示了SoundCTM在无需训练的情况下实现可控声音生成的能力。
English
Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM's training framework and introduce a novel feature distance by utilizing the teacher's network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner.

Summary

AI-Generated Summary

PDF90December 12, 2024