音频编辑X技术报告
Step-Audio-EditX Technical Report
November 5, 2025
作者: Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu
cs.AI
摘要
我们推出Step-Audio-EditX——首个基于开源大语言模型的音频系统,在实现情感、说话风格及副语言特征等表达性迭代编辑的同时,兼具强大的零样本文本转语音能力。我们的核心创新在于仅利用大间隔合成数据进行训练,无需依赖基于嵌入的先验知识或辅助模块。这种大间隔学习方法既支持对声音的迭代控制,又能实现高表现力,标志着从传统表征级解耦研究范式的根本性转变。评估结果表明,Step-Audio-EditX在情感编辑等细粒度控制任务上均优于MiniMax-2.6-hd与Doubao-Seed-TTS-2.0系统。
English
We present Step-Audio-EditX, the first open-source LLM-based audio model
excelling at expressive and iterative audio editing encompassing emotion,
speaking style, and paralinguistics alongside robust zero-shot text-to-speech
(TTS) capabilities.Our core innovation lies in leveraging only large-margin
synthetic data, which circumvents the need for embedding-based priors or
auxiliary modules. This large-margin learning approach enables both iterative
control and high expressivity across voices, and represents a fundamental pivot
from the conventional focus on representation-level disentanglement. Evaluation
results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and
Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.