표현 유사성 정규화를 통한 오디오 생성 제어성 향상

초록

본 논문은 모델 학습 과정에서 오디오와 텍스트 표현 간의 정렬을 강조함으로써 오디오 생성에 대한 제어를 향상시키는 혁신적인 접근 방식을 제시합니다. 언어 모델 기반 오디오 생성의 맥락에서, 모델은 텍스트와 오디오 토큰 표현 모두로부터 입력을 활용하여 후속 오디오 토큰을 예측합니다. 그러나 현재 구성에서는 선택된 텍스트 표현과 언어 모델의 예측 간의 정렬을 보장하기 위한 명시적인 정규화가 부족합니다. 우리의 제안은 오디오와 텍스트 표현 정규화를 통합하는 것인데, 특히 언어 모델 학습 중 교차 주의(cross attention)에서 텍스트 조건이 제외되는 분류자 없는 지도(classifier-free guidance, CFG) 단계에서 이를 적용합니다. 이 제안된 표현 정규화의 목표는 동일한 학습 배치 내의 다른 샘플들과 비교하여 오디오와 텍스트 유사성의 불일치를 최소화하는 것입니다. 음악 및 오디오 생성 작업에 대한 실험 결과는 우리가 제안한 방법이 오디오와 음악 생성 모두에서 객관적 지표의 개선을 가져오며, 오디오 생성에 대한 인간의 인지적 측면에서도 향상을 보여줌을 입증합니다.

English

This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverages input from both textual and audio token representations to predict subsequent audio tokens. However, the current configuration lacks explicit regularization to ensure the alignment between the chosen text representation and the language model's predictions. Our proposal involves the incorporation of audio and text representation regularization, particularly during the classifier-free guidance (CFG) phase, where the text condition is excluded from cross attention during language model training. The aim of this proposed representation regularization is to minimize discrepancies in audio and text similarity compared to other samples within the same training batch. Experimental results on both music and audio generation tasks demonstrate that our proposed methods lead to improvements in objective metrics for both audio and music generation, as well as an enhancement in the human perception for audio generation.

표현 유사성 정규화를 통한 오디오 생성 제어성 향상

Enhance audio generation controllability through representation similarity regularization

초록

Support