SoundWeaver: テキストからオーディオへの拡散モデルサービリングのための意味論的ウォームスタート

要旨

テキスト音声拡散モデルは高精細な音声を生成するが、数十回の関数評価（NFE）を必要とし、数秒の遅延と限られたスループットが生じる。本論文では、意味的に類似したキャッシュ音声からのウォームスタートによりテキスト音声拡散を高速化する、学習不要かつモデル非依存の初の推論システム「SoundWeaver」を提案する。SoundWeaverは3つの構成要素を導入する：意味的・時間長認識ゲーティングによりキャッシュ候補を検索し時間軸調整する参照選択器、スキップするNFE割合を動的に決定するスキップゲーター、品質認識型の追い出しと洗練によりキャッシュ効用を維持する軽量キャッシュ管理器である。実世界の音声トレースを用いた評価では、SoundWeaverは約1,000エントリのキャッシュ規模で知覚品質を維持または向上させつつ、1.8～3.0倍の遅延低減を達成した。

English

Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0 times latency reduction with a cache of only {sim}1K entries while preserving or improving perceptual quality.

SoundWeaver: テキストからオーディオへの拡散モデルサービリングのための意味論的ウォームスタート

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

要旨

Support