Step-Audio-EditX 기술 보고서

초록

Step-Audio-EditX를 소개합니다. 이는 감정, 화법, 파라링귀스틱스를 포함한 표현력豊かな 반복적 오디오 편집과 함께 강력한 제로샷 텍스트-음성 변환(TTS) 기능을 모두 갖춘 최초의 오픈소스 LLM 기반 오디오 모델입니다. 우리의 핵심 혁신은 임베딩 기반 사전 정보나 보조 모듈 없이도 대규모 마진 합성 데이터만을 활용하는 데 있습니다. 이러한 대규모 마진 학습 접근법은 다양한 음성에 걸친 반복적 제어와 높은 표현력을 동시에 가능하게 하며, 기존의 표현 수준 분리(disentanglement)에 집중하던 관행에서 근본적인 전환을 의미합니다. 평가 결과, Step-Audio-EditX는 감정 편집 및 기타 세밀한 제어 작업에서 MiniMax-2.6-hd와 Doubao-Seed-TTS-2.0을 모두 능가하는 성능을 보여줍니다.

English

We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities.Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

Step-Audio-EditX 기술 보고서

Step-Audio-EditX Technical Report

초록

Support