Soap2Soap：基于多智能体协作的长电影视频重制

摘要

我们研究系列级别的电影重制问题，这是一个长时域的视频到视频生成任务，通过风格化或演员替换实现对整集或整部影片的定位，同时严格保留数百个镜头中的叙事结构、动作编排和角色身份。现有视频生成与编辑流程在此场景下常因复合身份漂移、背景突变以及大范围镜头运动与视角变换引发的语义侵蚀而失效。为此，我们提出Soap2Soap——一个多智能体框架，通过双桥接一致性机制强化长期语言-视觉一致性：以场景感知的JSON剧本作为持久语义骨架，并在场景与镜头层级动态分配视觉参考锚点。为在视频合成前抑制漂移，我们引入批量关键帧一致性，通过基于网格的公式在共享潜在上下文中联合生成多个关键帧。闭环验证智能体进一步对身份、稳定性和对齐性进行审计，触发选择性重新生成。在SoapBench上的实验表明，该方法在长期一致性与叙事保真度上较商业视频生成API有显著提升。

English

We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.