SoundWeaver：面向文本到音频扩散服务的语义热启动技术

摘要

文本到音频扩散模型能够生成高保真音频，但需要数十次函数评估（NFE），导致多秒级延迟和有限吞吐量。我们提出SoundWeaver——首个无需重新训练、模型无关的服务系统，通过从语义相似的缓存音频进行热启动来加速文本到音频扩散过程。该系统包含三个核心组件：通过语义和时长感知门控机制检索并时序对齐缓存候选样本的参考选择器；动态决定可跳过NFE比例的跳跃门控器；以及通过质量感知淘汰与优化机制维护缓存效用的轻量级缓存管理器。在真实音频数据集上的实验表明，仅需约1000条条目的缓存，SoundWeaver即可在保持或提升感知质量的同时实现1.8至3.0倍的延迟降低。

English

Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0 times latency reduction with a cache of only {sim}1K entries while preserving or improving perceptual quality.