Soap2Soap：通過多智能體協作的長篇電影視頻重製

摘要

我們研究了系列層級的電影翻拍，這是一個長時域的影片到影片生成問題，透過風格化或演員替換來定位整集或整部影片，同時在數百個鏡頭中嚴格保持敘事結構、動作編排和角色身份的一致性。現有的影片生成與編輯流程在此機制下常因大範圍鏡頭運動與視角變化所導致的身份漂移、背景突變及語義侵蝕而崩潰。我們提出Soap2Soap，一個透過雙橋一致性機制強制執行長期語言-視覺一致性的多智能體框架：以場景感知的JSON劇本作為持久語義主幹，並在場景與鏡頭層級動態分配視覺參考錨點。為了在影片合成前抑制漂移，我們引入批次關鍵影格一致性，透過基於網格的公式在共享潛在上下文中聯合生成多個關鍵影格。閉環驗證代理進一步稽核身份、穩定性與對齊性，以觸發選擇性重新生成。在SoapBench上的實驗顯示，此方法在長期一致性與敘事保真度上較商業影片生成API有顯著改善。

English

We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.