NarraScore:通过分层情感控制连接视觉叙事与音乐动态
NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control
February 9, 2026
作者: Yufan Wen, Zhaocheng Liu, YeGuo Hua, Ziyi Guo, Lihua Zhang, Chun Yuan, Jian Wu
cs.AI
摘要
为长视频合成连贯配乐仍是一项艰巨挑战,目前受限于三大关键障碍:计算可扩展性、时序连贯性,以及最关键的——对叙事逻辑动态演变的普遍语义盲区。为突破这些限制,我们提出NarraScore框架,其核心思想在于将情感视为叙事逻辑的高密度压缩表达。我们创新性地利用冻结式视觉语言模型作为连续情感感知器,将高维视觉流蒸馏为稠密的叙事感知效价-唤醒轨迹。在机制设计上,NarraScore采用双分支注入策略协调全局结构与局部动态:全局语义锚点确保风格稳定性,而精准的令牌级情感适配器通过直接元素残差注入调控局部张力。这种极简设计绕过了稠密注意力与架构复制的瓶颈,有效缓解了数据稀缺导致的过拟合风险。实验表明,NarraScore以可忽略的计算开销实现了最先进的连贯性与叙事对齐度,为长视频配乐生成建立了全自动范式。
English
Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a Global Semantic Anchor ensures stylistic stability, while a surgical Token-Level Affective Adapter modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.