UnityShots: 记忆驱动的多镜头音视频生成与边界感知门控
UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating
June 19, 2026
作者: Jiehui Huang, Yuechen Zhang, Bin Xia, Jiahao Wang, Xu He, Zhenchao Tang, Meng Chu, Xin Tao, Pengfei Wan, Jiaya Jia
cs.AI
摘要
生成连贯的多镜头视频需要结构化的跨镜头记忆。主体外观、场景上下文和说话者身份必须在镜头切换间保持一致。现有方法要么在固定长度的序列上进行端到端训练且无法扩展,要么通过线性增长的记忆库逐镜头生成,要么在缺乏多镜头感知基础架构的情况下,借助大语言模型规划器编排预训练生成器。我们提出UnityShots,这是一种基于LTX-2.3构建的、由记忆驱动的多镜头音视频生成系统,并在标注过的电影和音乐视频镜头上进行训练。视频流中维护两个固定大小的槽位:一个锚定于开场镜头的长期记忆槽位,以及一个存储紧邻前一镜头尾部的短期记忆槽位。两者在每个镜头切换时通过融合视觉切分概率与节拍追踪信号的边界条件门控进行更新。音频流在每个镜头输入一个参考说话者标记,以在不使用滑动音频库的情况下保留音色。通过自适应层归一化学习到的离散切分类型先验,成为推理阶段控制过渡强度的可调节参数。我们发布了一个包含200个多文化、多镜头序列的基准数据集,涵盖六个民族地区和十种以上语言,同时提供每个镜头的参考身份、参考音频以及每个镜头边界的过渡标签。在图像到视频、文本到视频和参考到视频三种条件模式下评估,UnityShots在所有跨镜头连贯性指标上均领先于开源基线,并在多镜头评估维度上媲美最强闭源系统。
English
Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3, trained on annotated cinematic and music-video shots. The video stream maintains two fixed-size slots, a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail, both updated at every cut by a boundary-conditioned gate that fuses visual cut probability and beat-tracker signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of 200 multi-cultural multi-shot sequences spanning six ethnic regions and ten or more languages, with per-shot reference identities, reference audio, and per-boundary transition labels. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots leads open-source baselines on every cross-shot coherence metric and matches the strongest closed-source system on the multi-shot axes.