Soap2Soap: マルチエージェント連携による長編映画風動画のリメイク

要旨

本論文では、シリーズレベルの映画的リメイク、すなわち数百のショットにわたってナラティブ構造、振付動作、キャラクターのアイデンティティを厳密に保持しながら、スタイル変換や俳優の差し替えによって全エピソードや映画を変換する長期的なビデオ間生成問題を研究する。既存のビデオ生成・編集パイプラインは、大きなカメラ動作や視点変更の下で、増幅されるアイデンティティのドリフト、背景の変異、意味の浸食により、この領域ではしばしば機能不全に陥る。我々はSoap2Soapを提案する。これは、デュアルブリッジ一貫性機構を通じて長期的な言語-視覚的一貫性を強制するマルチエージェントフレームワークである。この機構は、永続的な意味的バックボーンとして機能するシーン認識型JSONスクリプトと、シーンレベルおよびショットレベルの両方で動的に割り当てられる視覚参照アンカーから構成される。ビデオ合成前のドリフトを抑制するために、我々はバッチキーフレーム一貫性を導入する。これは、グリッドベースの定式化により共有潜在コンテキスト内で複数のキーフレームを同時に生成するものである。閉ループ検証エージェントはさらに、アイデンティティ、安定性、整合性を監査し、選択的再生成をトリガーする。SoapBenchでの実験は、長期的な一貫性とナラティブの忠実性において、商用ビデオ生成APIを大きく上回る改善を示している。

English

We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.