UnityShots: メモリ駆動型マルチショット音声-動画生成における境界認識ゲーティング

要旨

一貫性のあるマルチショット映像を生成するためには、構造化されたショット間メモリが必要となる。カットを跨いでも、被写体の外観、シーンのコンテキスト、話者の同一性が維持されなければならない。既存の手法は、固定長シーケンスでエンドツーエンドに学習するためスケーラビリティに欠けるもの、線形に増加するメモリバンクを用いてショットごとに生成するもの、あるいはマルチショット対応のバックボーンを持たずにLLMプランナーの下で事前学習済み生成器を統括するものに大別される。本稿では、UnityShotsを提案する。これは、LTX-2.3を基盤とし、注釈付きの映画およびミュージックビデオのショットで学習された、メモリ駆動型のマルチショット音声・動画生成システムである。映像ストリームは2つの固定サイズスロット、すなわち開始ショットに固定された長期記憶スロットと、直前の末尾を保持する短期記憶スロットを維持し、視覚的なカット確率とビートトラッカー信号を融合する境界条件付きゲートによって、各カットでこれらが更新される。音声ストリームは、スライディング音声バンクを用いずに声の音色を保持するため、各ショットに参照話者トークンを注入する。AdaLNを通じて学習された離散カットタイプ事前分布は、推論時にトランジションの強度を制御するノブとして機能する。また、6つの民族地域と10以上の言語にわたる200の多文化的マルチショットシーケンスからなるベンチマークを公開する。これには、ショットごとの参照同一性、参照音声、および境界ごとのトランジションラベルが含まれる。I2V、T2V、R2Vの各条件付けモードで評価した結果、UnityShotsはすべてのショット間一貫性指標においてオープンソースベースラインを凌駕し、マルチショット軸において最も強力なクローズドソースシステムに匹敵する性能を示した。

English

Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3, trained on annotated cinematic and music-video shots. The video stream maintains two fixed-size slots, a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail, both updated at every cut by a boundary-conditioned gate that fuses visual cut probability and beat-tracker signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of 200 multi-cultural multi-shot sequences spanning six ethnic regions and ten or more languages, with per-shot reference identities, reference audio, and per-boundary transition labels. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots leads open-source baselines on every cross-shot coherence metric and matches the strongest closed-source system on the multi-shot axes.