SAO-Instruct:基于自然语言的自定义音频编辑指令系统
SAO-Instruct: Free-form Audio Editing using Natural Language Instructions
October 26, 2025
作者: Michael Ungersböck, Florian Grötschla, Luca A. Lanzendörfer, June Young Yi, Changho Choi, Roger Wattenhofer
cs.AI
摘要
生成式模型在根据简短文本描述合成高保真音频方面已取得显著进展。然而,利用自然语言编辑现有音频的研究仍处于探索不足的状态。现有方法要么需要完整描述编辑后的音频,要么受限于预定义的编辑指令而缺乏灵活性。本研究提出SAO-Instruct模型,该模型基于Stable Audio Open架构,能够使用任意自由形式的自然语言指令编辑音频片段。为训练模型,我们通过Prompt-to-Prompt、DDPM反演及人工编辑流程构建了包含音频编辑三元组(输入音频、编辑指令、输出音频)的数据集。尽管部分训练数据为合成数据,但模型对真实场景音频片段和未见过编辑指令均展现出良好泛化能力。实验表明,SAO-Instruct在客观指标上达到竞争性性能,并在主观听感测试中优于其他音频编辑方法。为促进后续研究,我们公开了代码与模型权重。
English
Generative models have made significant progress in synthesizing
high-fidelity audio from short textual descriptions. However, editing existing
audio using natural language has remained largely underexplored. Current
approaches either require the complete description of the edited audio or are
constrained to predefined edit instructions that lack flexibility. In this
work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of
editing audio clips using any free-form natural language instruction. To train
our model, we create a dataset of audio editing triplets (input audio, edit
instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual
editing pipeline. Although partially trained on synthetic data, our model
generalizes well to real in-the-wild audio clips and unseen edit instructions.
We demonstrate that SAO-Instruct achieves competitive performance on objective
metrics and outperforms other audio editing approaches in a subjective
listening study. To encourage future research, we release our code and model
weights.