A²RD:用於長視頻一致性的代理式自回歸擴散
A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency
May 7, 2026
作者: Do Xuan Long, Yale Song, Min-Yen Kan, Tomas Pfister, Long T. Le
cs.AI
摘要
生成連貫且一致的長時影片仍是一項根本性挑戰。現有方法在長時間跨度下容易出現語義漂移與敘事崩潰的問題。我們提出A²RD,一種基於智能體的自回歸擴散架構,將創意生成與一致性維護加以分離。A²RD將長時影片生成構建為閉環過程,透過「檢索-合成-精煉-更新」循環,逐段生成並自我改進影片。該架構包含三大核心組件:(i) 多模態影片記憶,追蹤跨模態的影片進展;(ii) 自適應段落生成,動態切換生成模式以兼顧自然推進與視覺一致性;(iii) 分層測試時自我改進,在影格層與影片層對每個段落進行自我修正,防止誤差累積。此外,我們引入LVBench-C,一個包含非線性實體與環境轉變的挑戰性基準,以嚴格測試長時一致性。在公開基準與LVBench-C上(影片長度從一分鐘至十分鐘),A²RD在一致性上超越當前最佳基線高達30%,在敘事連貫性上提升20%。人工評估證實了這些進步,同時也顯示出在動態與過渡平滑度上的顯著改善。
English
Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A^2RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A^2RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A^2RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.