AudioStory：利用大型語言模型生成長篇敘事音頻

摘要

近期，文本到音频（TTA）生成技術在合成短音頻片段方面取得了顯著進展，但在處理長篇敘事音頻時仍面臨挑戰，這需要時間上的連貫性和組合推理能力。為填補這一空白，我們提出了AudioStory，這是一個整合大型語言模型（LLMs）與TTA系統的統一框架，用於生成結構化的長篇音頻敘事。AudioStory具備強大的指令遵循與推理生成能力，它利用LLMs將複雜的敘事查詢分解為帶有上下文提示的時間順序子任務，從而實現場景轉換的連貫性和情感基調的一致性。AudioStory擁有兩大亮點：（1）解耦橋接機制：AudioStory將LLM與擴散模型的協作分離為兩個專門組件，即用於事件內語義對齊的橋接查詢和用於跨事件連貫性保持的殘差查詢。（2）端到端訓練：通過在單一端到端框架內統一指令理解與音頻生成，AudioStory消除了模塊化訓練管道的需求，同時增強了組件間的協同效應。此外，我們建立了AudioStory-10K基準，涵蓋動畫音景和自然聲音敘事等多樣化領域。大量實驗表明，AudioStory在單音頻生成和敘事音頻生成上均表現優異，在指令遵循能力和音頻保真度上均超越了先前的TTA基線。我們的代碼已開源於https://github.com/TencentARC/AudioStory。

English

Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for cross-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code is available at https://github.com/TencentARC/AudioStory