WavJourney: 大規模言語モデルを用いた構成可能なオーディオ生成

要旨

大規模言語モデル（LLMs）は、複雑な言語および視覚タスクに取り組むために多様な専門モデルを統合する点で大きな可能性を示しています。人工知能生成コンテンツ（AIGC）の分野を進展させる上でその重要性は高いものの、インテリジェントなオーディオコンテンツ作成における可能性は未開拓のままです。本研究では、テキスト指示に基づいて、音声、音楽、効果音を含むストーリーラインを持つオーディオコンテンツを作成する問題に取り組みます。私たちは、LLMsを活用してさまざまなオーディオモデルを接続し、オーディオコンテンツ生成を行うシステム「WavJourney」を提案します。聴覚シーンのテキスト記述が与えられると、WavJourneyはまずLLMsに促して、オーディオストーリーテリング専用の構造化されたスクリプトを生成します。このオーディオスクリプトは、多様なオーディオ要素を時空間的関係に基づいて整理したものです。オーディオの概念的表現として、オーディオスクリプトは人間の関与のためのインタラクティブで解釈可能な根拠を提供します。その後、オーディオスクリプトはスクリプトコンパイラに送られ、コンピュータプログラムに変換されます。プログラムの各行は、タスク固有のオーディオ生成モデルまたは計算操作関数（例：連結、ミックス）を呼び出します。そして、コンピュータプログラムが実行され、オーディオ生成のための説明可能なソリューションが得られます。私たちは、SF、教育、ラジオドラマなど、多様な現実世界のシナリオにおいてWavJourneyの実用性を実証します。WavJourneyの説明可能でインタラクティブな設計は、マルチラウンドの対話を通じて人間と機械の共創を促進し、オーディオ制作における創造的な制御と適応性を高めます。WavJourneyは人間の想像力を音響化し、マルチメディアコンテンツ作成における新たな創造の道を開きます。

English

Large Language Models (LLMs) have shown great promise in integrating diverse expert models to tackle intricate language and vision tasks. Despite their significance in advancing the field of Artificial Intelligence Generated Content (AIGC), their potential in intelligent audio content creation remains unexplored. In this work, we tackle the problem of creating audio content with storylines encompassing speech, music, and sound effects, guided by text instructions. We present WavJourney, a system that leverages LLMs to connect various audio models for audio content generation. Given a text description of an auditory scene, WavJourney first prompts LLMs to generate a structured script dedicated to audio storytelling. The audio script incorporates diverse audio elements, organized based on their spatio-temporal relationships. As a conceptual representation of audio, the audio script provides an interactive and interpretable rationale for human engagement. Afterward, the audio script is fed into a script compiler, converting it into a computer program. Each line of the program calls a task-specific audio generation model or computational operation function (e.g., concatenate, mix). The computer program is then executed to obtain an explainable solution for audio generation. We demonstrate the practicality of WavJourney across diverse real-world scenarios, including science fiction, education, and radio play. The explainable and interactive design of WavJourney fosters human-machine co-creation in multi-round dialogues, enhancing creative control and adaptability in audio production. WavJourney audiolizes the human imagination, opening up new avenues for creativity in multimedia content creation.

WavJourney: 大規模言語モデルを用いた構成可能なオーディオ生成

WavJourney: Compositional Audio Creation with Large Language Models

要旨

Support