Step-Audio-EditX 技術報告
Step-Audio-EditX Technical Report
November 5, 2025
作者: Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu
cs.AI
摘要
我們推出 Step-Audio-EditX,這是首個基於開源大型語言模型的音訊系統,在情感、說話風格與副語言特徵的表現性迭代編輯方面表現卓越,同時具備強大的零樣本文字轉語音功能。我們的核心創新在於僅利用大邊界合成數據,無需依賴基於嵌入的先驗知識或輔助模組。這種大邊界學習方法既能實現迭代控制,又能跨聲音表現高度靈活性,從根本上扭轉了傳統對表徵層面解耦的關注重點。評估結果顯示,Step-Audio-EditX 在情感編輯及其他細粒度控制任務上,均超越 MiniMax-2.6-hd 與 Doubao-Seed-TTS-2.0 的表現。
English
We present Step-Audio-EditX, the first open-source LLM-based audio model
excelling at expressive and iterative audio editing encompassing emotion,
speaking style, and paralinguistics alongside robust zero-shot text-to-speech
(TTS) capabilities.Our core innovation lies in leveraging only large-margin
synthetic data, which circumvents the need for embedding-based priors or
auxiliary modules. This large-margin learning approach enables both iterative
control and high expressivity across voices, and represents a fundamental pivot
from the conventional focus on representation-level disentanglement. Evaluation
results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and
Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.