ChatPaper.aiChatPaper

V2M-Zero:零配对时间对齐的视频到音乐生成技术

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

March 11, 2026
作者: Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan
cs.AI

摘要

现有文本生成音乐模型因缺乏细粒度时间控制,难以实现与视频事件的时间对齐。我们提出V2M-Zero——一种零配对视频生成音乐方法,可为视频输出时间同步的音乐。该方法基于关键发现:时间同步需匹配变化发生的时机与程度,而非变化内容本身。尽管音乐与视觉事件在语义上存在差异,但它们具有可独立捕捉的跨模态时间结构。我们通过预训练音乐/视频编码器计算模态内相似度,构建事件曲线来捕捉这种结构。通过独立测量各模态内的时间变化,这些曲线提供了跨模态的可比表征。由此实现简易训练策略:先在音乐事件曲线上微调文本生成音乐模型,推理时直接替换为视频事件曲线,无需跨模态训练或配对数据。在OES-Pub、MovieGenBench-Music和AIST++数据集上,V2M-Zero相较配对数据基线实现显著提升:音频质量提高5-21%,语义对齐度提升13-15%,时间同步性改善21-52%,舞蹈视频节拍对齐度提升28%。大规模众包主观听力实验也验证了相似结论。总体表明,通过模态内特征而非跨模态监督实现时间对齐,对视频生成音乐任务具有有效性。结果详见https://genjib.github.io/v2m_zero/
English
Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/
PDF21March 13, 2026