DITTO-2: 音楽生成のための蒸留拡散推論時最適化

要旨

人間中心のAIベース音楽創作において、制御可能な音楽生成手法は極めて重要であるが、現在は速度、品質、制御設計のトレードオフによって制限されている。特に、Diffusion Inference-Time T-optimization（DITTO）は最先端の結果を提供するが、リアルタイムの10倍以上遅く、実用的な使用が制限されている。本論文では、Distilled Diffusion Inference-Time T-Optimization（DITTO-2）を提案し、推論時間最適化ベースの制御を高速化し、音楽のインペインティング、アウトペインティング、強度、メロディ、音楽構造制御など、多様なアプリケーションにおいてリアルタイムを超える生成を可能にする。本手法は、(1) 事前学習済み拡散モデルを効率的に修正された一貫性または一貫性軌道蒸留プロセスにより高速サンプリングのために蒸留し、(2) 蒸留モデルを使用して1ステップサンプリングを効率的な代理最適化タスクとして推論時間最適化を実行し、(3) 推定されたノイズ潜在変数を使用して最終的なマルチステップサンプリング生成（デコード）を行い、最高品質の高速で制御可能な生成を実現する。徹底的な評価を通じて、本手法が生成速度を10～20倍以上高速化するだけでなく、制御の遵守度と生成品質を同時に向上させることを確認した。さらに、テキスト遵守度（CLAPスコア）を最大化する新たなアプリケーションに本アプローチを適用し、テキスト入力なしの無条件拡散モデルを最先端のテキスト制御を実現するモデルに変換できることを示す。音声サンプルはhttps://ditto-music.github.io/ditto2/で確認できる。

English

Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at https://ditto-music.github.io/ditto2/.

DITTO-2: 音楽生成のための蒸留拡散推論時最適化

DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

要旨

Support