조건부 오디오 생성을 위한 인-컨텍스트 프롬프트 편집

초록

분포 변화는 머신러닝 모델의 실제 배포에서 주요한 과제로, 이 모델들이 현실 세계의 데이터에 적절히 대응하지 못할 수 있기 때문입니다. 이는 특히 텍스트-오디오 생성에서 두드러지게 나타나는데, 인코딩된 표현이 보지 못한 프롬프트에 의해 쉽게 훼손되어 생성된 오디오의 품질이 저하됩니다. 제한된 텍스트-오디오 쌍은 사용자 프롬프트가 불충분하게 명시된 상황에서 조건부 오디오 생성에 충분하지 않습니다. 특히, 우리는 학습 데이터셋의 프롬프트와 달리 사용자 프롬프트로 생성된 오디오 샘플에서 일관된 오디오 품질 저하를 관찰했습니다. 이를 해결하기 위해, 우리는 학습 캡션을 시범적인 예시로 활용하여 사용자 프롬프트를 재검토하는 검색 기반 인-컨텍스트 프롬프트 편집 프레임워크를 제안합니다. 이 프레임워크는 학습 캡션을 참조하여 편집된 사용자 프롬프트 세트에서 오디오 품질을 향상시켰음을 보여줍니다.

English

Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional audio generation in the wild as user prompts are under-specified. In particular, we observe a consistent audio quality degradation in generated audio samples with user prompts, as opposed to training set prompts. To this end, we present a retrieval-based in-context prompt editing framework that leverages the training captions as demonstrative exemplars to revisit the user prompts. We show that the framework enhanced the audio quality across the set of collected user prompts, which were edited with reference to the training captions as exemplars.

조건부 오디오 생성을 위한 인-컨텍스트 프롬프트 편집

In-Context Prompt Editing For Conditional Audio Generation

초록

Support