ChatPaper.aiChatPaper

Step-Audio-EditX 技術報告

Step-Audio-EditX Technical Report

November 5, 2025
作者: Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu
cs.AI

摘要

我們推出 Step-Audio-EditX,這是首個基於開源大型語言模型的音訊系統,在情感、說話風格與副語言特徵的表現性迭代編輯方面表現卓越,同時具備強大的零樣本文字轉語音功能。我們的核心創新在於僅利用大邊界合成數據,無需依賴基於嵌入的先驗知識或輔助模組。這種大邊界學習方法既能實現迭代控制,又能跨聲音表現高度靈活性,從根本上扭轉了傳統對表徵層面解耦的關注重點。評估結果顯示,Step-Audio-EditX 在情感編輯及其他細粒度控制任務上,均超越 MiniMax-2.6-hd 與 Doubao-Seed-TTS-2.0 的表現。
English
We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities.Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.
PDF283December 2, 2025