MOSS-Speech: 텍스트 지도 없이 진정한 음성-음성 모델을 향하여

초록

음성 대화 시스템은 종종 음성을 전사, 처리, 재합성하는 연속적인 파이프라인에 의존합니다. 이러한 설계는 효과적이지만, 부언어적 단서를 버리고 표현력을 제한합니다. 최근의 종단 간(end-to-end) 방법은 지연 시간을 줄이고 이러한 단서를 더 잘 보존하지만, 여전히 텍스트 중간 단계에 의존함으로써 근본적인 병목 현상을 야기합니다. 우리는 텍스트 지침 없이 직접 음성을 이해하고 생성할 수 있는 진정한 음성-대-음성 대형 언어 모델인 MOSS-Speech를 제안합니다. 우리의 접근 방식은 모달리티 기반의 레이어 분할 아키텍처와 고정된 사전 훈련 전략을 결합하여, 사전 훈련된 텍스트 LLM의 추론 능력과 지식을 유지하면서도 기본적인 음성 기능을 추가합니다. 실험 결과, 우리의 모델은 음성 질의응답에서 최첨단 성능을 달성하고, 기존의 텍스트 지도 시스템에 비해 비슷한 음성-대-음성 성능을 제공하면서도 여전히 경쟁력 있는 텍스트 성능을 유지합니다. 텍스트 지도와 직접 음성 생성 간의 격차를 좁힘으로써, 우리의 연구는 표현력 있고 효율적인 종단 간 음성 상호작용을 위한 새로운 패러다임을 확립합니다.

English

Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.

MOSS-Speech: 텍스트 지도 없이 진정한 음성-음성 모델을 향하여

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

초록

Support