ChatPaper.aiChatPaper

WavJourney:使用大型语言模型进行音频创作

WavJourney: Compositional Audio Creation with Large Language Models

July 26, 2023
作者: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
cs.AI

摘要

大型语言模型(LLMs)展现出在整合多样专家模型以应对复杂的语言和视觉任务方面的巨大潜力。尽管它们在推动人工智能生成内容(AIGC)领域取得重要进展方面具有重要意义,但它们在智能音频内容创作方面的潜力尚未被探索。在这项工作中,我们解决了利用文本指导创作涵盖语音、音乐和音效的音频内容的问题。我们提出了WavJourney,这是一个利用LLMs连接各种音频模型进行音频内容生成的系统。给定一个听觉场景的文本描述,WavJourney首先促使LLMs生成一个专门用于音频叙事的结构化脚本。音频脚本包含各种音频元素,根据它们的时空关系进行组织。作为音频的概念表示,音频脚本为人类参与提供了交互式和可解释的理由。随后,音频脚本被输入到脚本编译器中,将其转换为计算机程序。程序的每一行调用一个特定任务的音频生成模型或计算操作函数(例如连接、混合)。然后执行计算机程序以获得音频生成的可解释解决方案。我们展示了WavJourney在包括科幻、教育和广播剧等各种现实场景中的实用性。WavJourney的可解释和交互式设计促进了人机共同创作在多轮对话中的发展,增强了音频制作中的创造控制和适应性。WavJourney将人类想象音频化,为多媒体内容创作开辟了新的创意途径。
English
Large Language Models (LLMs) have shown great promise in integrating diverse expert models to tackle intricate language and vision tasks. Despite their significance in advancing the field of Artificial Intelligence Generated Content (AIGC), their potential in intelligent audio content creation remains unexplored. In this work, we tackle the problem of creating audio content with storylines encompassing speech, music, and sound effects, guided by text instructions. We present WavJourney, a system that leverages LLMs to connect various audio models for audio content generation. Given a text description of an auditory scene, WavJourney first prompts LLMs to generate a structured script dedicated to audio storytelling. The audio script incorporates diverse audio elements, organized based on their spatio-temporal relationships. As a conceptual representation of audio, the audio script provides an interactive and interpretable rationale for human engagement. Afterward, the audio script is fed into a script compiler, converting it into a computer program. Each line of the program calls a task-specific audio generation model or computational operation function (e.g., concatenate, mix). The computer program is then executed to obtain an explainable solution for audio generation. We demonstrate the practicality of WavJourney across diverse real-world scenarios, including science fiction, education, and radio play. The explainable and interactive design of WavJourney fosters human-machine co-creation in multi-round dialogues, enhancing creative control and adaptability in audio production. WavJourney audiolizes the human imagination, opening up new avenues for creativity in multimedia content creation.
PDF441December 15, 2024