WavJourney: 대규모 언어 모델을 활용한 구성적 오디오 생성

초록

대규모 언어 모델(LLMs)은 복잡한 언어 및 비전 작업을 해결하기 위해 다양한 전문가 모델을 통합하는 데 있어 큰 가능성을 보여주고 있습니다. 인공지능 생성 콘텐츠(AIGC) 분야를 발전시키는 데 있어 그 중요성에도 불구하고, 지능형 오디오 콘텐츠 생성에서의 잠재력은 아직 탐구되지 않았습니다. 본 연구에서는 텍스트 지시에 따라 대화, 음악, 음향 효과를 포함한 스토리라인을 가진 오디오 콘텐츠를 생성하는 문제를 다룹니다. 우리는 WavJourney를 제안하는데, 이는 다양한 오디오 모델을 연결하여 오디오 콘텐츠를 생성하기 위해 LLMs를 활용하는 시스템입니다. 청각적 장면에 대한 텍스트 설명이 주어지면, WavJourney는 먼저 LLMs를 사용하여 오디오 스토리텔링을 위한 구조화된 스크립트를 생성합니다. 이 오디오 스크립트는 다양한 오디오 요소를 포함하며, 이들은 시공간적 관계에 따라 조직됩니다. 오디오의 개념적 표현으로서, 오디오 스크립트는 인간의 참여를 위한 상호작용적이고 해석 가능한 근거를 제공합니다. 이후, 오디오 스크립트는 스크립트 컴파일러에 입력되어 컴퓨터 프로그램으로 변환됩니다. 프로그램의 각 라인은 작업별 오디오 생성 모델이나 계산 작업 함수(예: 연결, 혼합)를 호출합니다. 그런 다음 컴퓨터 프로그램이 실행되어 오디오 생성을 위한 설명 가능한 솔루션을 얻습니다. 우리는 WavJourney의 실용성을 과학 소설, 교육, 라디오 드라마 등 다양한 실제 시나리오에서 입증합니다. WavJourney의 설명 가능하고 상호작용적인 설계는 다중 라운드 대화에서 인간-기계 공동 창작을 촉진하며, 오디오 제작에서의 창의적 통제와 적응성을 강화합니다. WavJourney는 인간의 상상력을 오디오로 구현함으로써 멀티미디어 콘텐츠 창작에서 새로운 창의적 가능성을 열어줍니다.

English

Large Language Models (LLMs) have shown great promise in integrating diverse expert models to tackle intricate language and vision tasks. Despite their significance in advancing the field of Artificial Intelligence Generated Content (AIGC), their potential in intelligent audio content creation remains unexplored. In this work, we tackle the problem of creating audio content with storylines encompassing speech, music, and sound effects, guided by text instructions. We present WavJourney, a system that leverages LLMs to connect various audio models for audio content generation. Given a text description of an auditory scene, WavJourney first prompts LLMs to generate a structured script dedicated to audio storytelling. The audio script incorporates diverse audio elements, organized based on their spatio-temporal relationships. As a conceptual representation of audio, the audio script provides an interactive and interpretable rationale for human engagement. Afterward, the audio script is fed into a script compiler, converting it into a computer program. Each line of the program calls a task-specific audio generation model or computational operation function (e.g., concatenate, mix). The computer program is then executed to obtain an explainable solution for audio generation. We demonstrate the practicality of WavJourney across diverse real-world scenarios, including science fiction, education, and radio play. The explainable and interactive design of WavJourney fosters human-machine co-creation in multi-round dialogues, enhancing creative control and adaptability in audio production. WavJourney audiolizes the human imagination, opening up new avenues for creativity in multimedia content creation.

WavJourney: 대규모 언어 모델을 활용한 구성적 오디오 생성

WavJourney: Compositional Audio Creation with Large Language Models

초록

Support