스텝-오디오 2 기술 보고서

초록

본 논문은 산업 수준의 오디오 이해 및 음성 대화를 위해 설계된 종단 간(end-to-end) 멀티모달 대형 언어 모델인 Step-Audio~2를 소개한다. 잠재 오디오 인코더와 추론 중심의 강화 학습(RL)을 통합함으로써, Step-Audio 2는 자동 음성 인식(ASR) 및 오디오 이해에서 유망한 성능을 달성한다. 진정한 종단 간 음성 대화를 가능하게 하기 위해, Step-Audio 2는 언어 모델링에 이산 오디오 토큰 생성을 통합하여 발화 스타일 및 감정과 같은 부언어적 정보에 대한 반응성을 크게 향상시켰다. 실제 데이터에 내재된 풍부한 텍스트 및 음향 지식을 효과적으로 활용하기 위해, Step-Audio 2는 검색 증강 생성(RAG)을 통합하고, 환각 현상을 완화하기 위한 웹 검색 및 음색 전환을 위한 오디오 검색과 같은 외부 도구를 호출할 수 있다. 수백만 시간의 음성 및 오디오 데이터로 학습된 Step-Audio 2는 다양한 대화 시나리오에서 지능과 표현력을 제공한다. 평가 결과는 Step-Audio 2가 다른 오픈소스 및 상용 솔루션과 비교하여 다양한 오디오 이해 및 대화 벤치마크에서 최첨단 성능을 달성함을 보여준다. 더 많은 정보는 https://github.com/stepfun-ai/Step-Audio2를 방문하시기 바란다.

English

This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

스텝-오디오 2 기술 보고서

Step-Audio 2 Technical Report

초록

Support