Step-Audio-EditX 技術報告

摘要

我們推出 Step-Audio-EditX，這是首個基於開源大型語言模型的音訊系統，在情感、說話風格與副語言特徵的表現性迭代編輯方面表現卓越，同時具備強大的零樣本文字轉語音功能。我們的核心創新在於僅利用大邊界合成數據，無需依賴基於嵌入的先驗知識或輔助模組。這種大邊界學習方法既能實現迭代控制，又能跨聲音表現高度靈活性，從根本上扭轉了傳統對表徵層面解耦的關注重點。評估結果顯示，Step-Audio-EditX 在情感編輯及其他細粒度控制任務上，均超越 MiniMax-2.6-hd 與 Doubao-Seed-TTS-2.0 的表現。

English

We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities.Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

Step-Audio-EditX 技術報告

Step-Audio-EditX Technical Report

摘要

Support