Qwen-Audio: 통합 대규모 오디오-언어 모델을 통한 범용 오디오 이해의 발전

초록

최근, 인간과의 오디오 상호작용을 위한 명령 수행 오디오-언어 모델이 광범위한 관심을 받고 있습니다. 그러나 다양한 오디오 유형과 작업을 처리할 수 있는 사전 훈련된 오디오 모델의 부재로 인해 이 분야의 발전이 저해되어 왔습니다. 결과적으로, 기존 연구 대부분은 제한된 범위의 상호작용 기능만을 지원할 수 있었습니다. 본 논문에서는 Qwen-Audio 모델을 개발하고, 인간 음성, 자연 소리, 음악, 노래 등 다양한 오디오 유형과 30개 이상의 작업을 포괄하는 오디오-언어 사전 훈련을 확장하여 보편적인 오디오 이해 능력을 촉진함으로써 이러한 한계를 해결하고자 합니다. 그러나 모든 작업과 데이터셋을 직접 공동 훈련할 경우, 작업 초점, 언어, 주석의 세분화, 텍스트 구조 등의 차이로 인해 서로 다른 데이터셋과 연관된 텍스트 레이블이 상당한 변동을 보이기 때문에 간섭 문제가 발생할 수 있습니다. 이러한 일대다 간섭 문제를 극복하기 위해, 우리는 디코더에 계층적 태그 시퀀스를 조건으로 하는 다중 작업 훈련 프레임워크를 신중하게 설계하여 공유 태그와 지정 태그를 통해 각각 지식 공유를 촉진하고 간섭을 방지합니다. 특히, Qwen-Audio는 작업별 미세 조정 없이도 다양한 벤치마크 작업에서 인상적인 성능을 달성하며, 경쟁 모델들을 능가합니다. Qwen-Audio의 능력을 기반으로, 우리는 다양한 오디오와 텍스트 입력을 허용하고 다중 턴 대화를 가능하게 하며 다양한 오디오 중심 시나리오를 지원하는 Qwen-Audio-Chat을 추가로 개발합니다.

English

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

Qwen-Audio: 통합 대규모 오디오-언어 모델을 통한 범용 오디오 이해의 발전

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

초록

Support