对齐生成式音乐AI与人类偏好：方法与挑战

摘要

近期音乐生成式AI在保真度与风格多样性方面取得了显著进展，但由于所采用的特定损失函数，这些系统往往难以契合人类细腻的审美偏好。本文主张将偏好对齐技术系统化应用于音乐生成领域，以弥合计算优化与人类音乐审美之间的根本差距。基于MusicRL的大规模偏好学习、DiffRhythm+中扩散偏好优化等多偏好对齐框架、以及Text2midi-InferAlign等推理时优化技术的最新突破，我们探讨了这些技术如何应对音乐特有的挑战：时序连贯性、和声一致性及主观质量评估。我们指出关键研究挑战包括长篇幅作品的可扩展性、偏好建模的可靠性等。展望未来，我们期待偏好对齐的音乐生成技术能在交互式作曲工具和个性化音乐服务中催生变革性应用。本研究呼吁持续开展跨学科合作，结合机器学习与音乐理论的前沿进展，构建真正服务于人类创作与体验需求的音乐AI系统。

English

Recent advances in generative AI for music have achieved remarkable fidelity and stylistic diversity, yet these systems often fail to align with nuanced human preferences due to the specific loss functions they use. This paper advocates for the systematic application of preference alignment techniques to music generation, addressing the fundamental gap between computational optimization and human musical appreciation. Drawing on recent breakthroughs including MusicRL's large-scale preference learning, multi-preference alignment frameworks like diffusion-based preference optimization in DiffRhythm+, and inference-time optimization techniques like Text2midi-InferAlign, we discuss how these techniques can address music's unique challenges: temporal coherence, harmonic consistency, and subjective quality assessment. We identify key research challenges including scalability to long-form compositions, reliability amongst others in preference modelling. Looking forward, we envision preference-aligned music generation enabling transformative applications in interactive composition tools and personalized music services. This work calls for sustained interdisciplinary research combining advances in machine learning, music-theory to create music AI systems that truly serve human creative and experiential needs.