WavJourney：使用大型語言模型進行組合音頻創作

摘要

大型語言模型（LLMs）展現了整合多樣專家模型以應對複雜語言和視覺任務的巨大潛力。儘管它們在推動人工智慧生成內容（AIGC）領域方面具有重要意義，但它們在智能音頻內容創作方面的潛力尚未被探索。在這項工作中，我們解決了通過文本指導創建涵蓋語音、音樂和音效的故事情節的音頻內容的問題。我們提出了WavJourney，一個利用LLMs連接各種音頻模型進行音頻內容生成的系統。給定一個聽覺場景的文本描述，WavJourney首先提示LLMs生成專門用於音頻敘事的結構化腳本。音頻腳本包含多樣的音頻元素，根據它們的時空關係進行組織。作為音頻的概念表示，音頻腳本為人類參與提供了互動和可解釋的理由。隨後，音頻腳本被輸入腳本編譯器，將其轉換為一個計算機程序。程序的每一行調用一個特定任務的音頻生成模型或計算操作函數（例如，連接、混合）。然後執行計算機程序以獲得音頻生成的可解釋解決方案。我們展示了WavJourney在包括科幻、教育和廣播劇等各種現實場景中的實用性。WavJourney的可解釋和互動設計促進了人機共同創作在多輪對話中的實現，增強了音頻製作中的創造控制和適應性。WavJourney使人類想像具聲化，為多媒體內容創作開辟了新的創意途徑。

English

Large Language Models (LLMs) have shown great promise in integrating diverse expert models to tackle intricate language and vision tasks. Despite their significance in advancing the field of Artificial Intelligence Generated Content (AIGC), their potential in intelligent audio content creation remains unexplored. In this work, we tackle the problem of creating audio content with storylines encompassing speech, music, and sound effects, guided by text instructions. We present WavJourney, a system that leverages LLMs to connect various audio models for audio content generation. Given a text description of an auditory scene, WavJourney first prompts LLMs to generate a structured script dedicated to audio storytelling. The audio script incorporates diverse audio elements, organized based on their spatio-temporal relationships. As a conceptual representation of audio, the audio script provides an interactive and interpretable rationale for human engagement. Afterward, the audio script is fed into a script compiler, converting it into a computer program. Each line of the program calls a task-specific audio generation model or computational operation function (e.g., concatenate, mix). The computer program is then executed to obtain an explainable solution for audio generation. We demonstrate the practicality of WavJourney across diverse real-world scenarios, including science fiction, education, and radio play. The explainable and interactive design of WavJourney fosters human-machine co-creation in multi-round dialogues, enhancing creative control and adaptability in audio production. WavJourney audiolizes the human imagination, opening up new avenues for creativity in multimedia content creation.

WavJourney：使用大型語言模型進行組合音頻創作

WavJourney: Compositional Audio Creation with Large Language Models

摘要

Support