APEX：面向AI生成音乐的大规模多任务审美感知流行度预测系统

摘要

音乐流行度预测因其与艺术家、平台及推荐系统的关联性，正吸引日益增长的研究关注。然而，AI生成音乐平台的爆发式崛起催生了一个全新且尚未被充分探索的领域——每天有海量歌曲在没有传统艺术家声誉或厂牌支持的情况下被生产与消费。在此过程中，尚未被探索的关键要素是审美质量。我们提出APEX框架，这是首个面向AI生成音乐的大规模多任务学习系统，基于从Suno和Udio平台采集的21.1万首歌曲（累计1万小时音频）进行训练，通过自监督音乐理解模型MERT提取的冻结音频嵌入，同步预测基于用户参与的流行度指标（播放量与点赞分数）以及五个感知维度的审美质量。审美质量与流行度捕捉了音乐互补的两个层面：在Music Arena数据集（包含训练阶段未见的11种生成式音乐系统的人类两两偏好对战）的分布外评估中，引入审美特征能持续提升偏好预测准确率，证明所学表征在不同生成架构间具有强泛化能力。

English

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.