ChatPaper.aiChatPaper

SAO-Instruct:基于自然语言的自定义音频编辑指令系统

SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

October 26, 2025
作者: Michael Ungersböck, Florian Grötschla, Luca A. Lanzendörfer, June Young Yi, Changho Choi, Roger Wattenhofer
cs.AI

摘要

生成式模型在根據簡短文字描述合成高保真音頻方面已取得顯著進展。然而,利用自然語言編輯現有音頻的研究仍處於探索不足的狀態。現有方法要么需要完整描述編輯後的音頻,要么受限于預定義的編輯指令而缺乏靈活性。本研究提出SAO-Instruct模型,該模型基於Stable Audio Open架構,能夠使用任意自由形式的自然語言指令編輯音頻片段。為訓練模型,我們結合Prompt-to-Prompt、DDPM反演技術與人工編輯流程,構建了包含音頻編輯三元組(輸入音頻、編輯指令、輸出音頻)的數據集。儘管部分訓練數據為合成生成,我們的模型對真實場景音頻片段和未見過的編輯指令均展現出良好的泛化能力。實驗表明,SAO-Instruct在客觀指標上達到競爭性水準,並在主觀聽感測試中優於其他音頻編輯方法。為推動後續研究,我們公開了代碼與模型權重。
English
Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study. To encourage future research, we release our code and model weights.
PDF51December 1, 2025