用于条件音频生成的上下文提示编辑

摘要

在部署机器学习模型时，分布偏移是一个核心挑战，因为模型可能无法很好地适应真实世界的数据。这在文本转音频生成中尤为明显，编码表示很容易受到未见提示的影响，导致生成的音频质量下降 —— 有限的文本-音频对集合对于野外条件下的有条件音频生成仍然不足，因为用户提示不够明确。具体而言，我们观察到生成的音频样本中存在一致的音频质量下降，与训练集提示相比。为此，我们提出了一种基于检索的上下文提示编辑框架，利用训练字幕作为示范样本来重新审视用户提示。我们展示了该框架提高了整个收集的用户提示集的音频质量，这些提示是根据训练字幕作为示范样本进行编辑的。

English

Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional audio generation in the wild as user prompts are under-specified. In particular, we observe a consistent audio quality degradation in generated audio samples with user prompts, as opposed to training set prompts. To this end, we present a retrieval-based in-context prompt editing framework that leverages the training captions as demonstrative exemplars to revisit the user prompts. We show that the framework enhanced the audio quality across the set of collected user prompts, which were edited with reference to the training captions as exemplars.

用于条件音频生成的上下文提示编辑

In-Context Prompt Editing For Conditional Audio Generation

摘要

Support