WavJourney:使用大型語言模型進行組合音頻創作
WavJourney: Compositional Audio Creation with Large Language Models
July 26, 2023
作者: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
cs.AI
摘要
大型語言模型(LLMs)展現了整合多樣專家模型以應對複雜語言和視覺任務的巨大潛力。儘管它們在推動人工智慧生成內容(AIGC)領域方面具有重要意義,但它們在智能音頻內容創作方面的潛力尚未被探索。在這項工作中,我們解決了通過文本指導創建涵蓋語音、音樂和音效的故事情節的音頻內容的問題。我們提出了WavJourney,一個利用LLMs連接各種音頻模型進行音頻內容生成的系統。給定一個聽覺場景的文本描述,WavJourney首先提示LLMs生成專門用於音頻敘事的結構化腳本。音頻腳本包含多樣的音頻元素,根據它們的時空關係進行組織。作為音頻的概念表示,音頻腳本為人類參與提供了互動和可解釋的理由。隨後,音頻腳本被輸入腳本編譯器,將其轉換為一個計算機程序。程序的每一行調用一個特定任務的音頻生成模型或計算操作函數(例如,連接、混合)。然後執行計算機程序以獲得音頻生成的可解釋解決方案。我們展示了WavJourney在包括科幻、教育和廣播劇等各種現實場景中的實用性。WavJourney的可解釋和互動設計促進了人機共同創作在多輪對話中的實現,增強了音頻製作中的創造控制和適應性。WavJourney使人類想像具聲化,為多媒體內容創作開辟了新的創意途徑。
English
Large Language Models (LLMs) have shown great promise in integrating diverse
expert models to tackle intricate language and vision tasks. Despite their
significance in advancing the field of Artificial Intelligence Generated
Content (AIGC), their potential in intelligent audio content creation remains
unexplored. In this work, we tackle the problem of creating audio content with
storylines encompassing speech, music, and sound effects, guided by text
instructions. We present WavJourney, a system that leverages LLMs to connect
various audio models for audio content generation. Given a text description of
an auditory scene, WavJourney first prompts LLMs to generate a structured
script dedicated to audio storytelling. The audio script incorporates diverse
audio elements, organized based on their spatio-temporal relationships. As a
conceptual representation of audio, the audio script provides an interactive
and interpretable rationale for human engagement. Afterward, the audio script
is fed into a script compiler, converting it into a computer program. Each line
of the program calls a task-specific audio generation model or computational
operation function (e.g., concatenate, mix). The computer program is then
executed to obtain an explainable solution for audio generation. We demonstrate
the practicality of WavJourney across diverse real-world scenarios, including
science fiction, education, and radio play. The explainable and interactive
design of WavJourney fosters human-machine co-creation in multi-round
dialogues, enhancing creative control and adaptability in audio production.
WavJourney audiolizes the human imagination, opening up new avenues for
creativity in multimedia content creation.