A²RD：用於長視頻一致性的代理式自回歸擴散

摘要

生成連貫且一致的長時影片仍是一項根本性挑戰。現有方法在長時間跨度下容易出現語義漂移與敘事崩潰的問題。我們提出A²RD，一種基於智能體的自回歸擴散架構，將創意生成與一致性維護加以分離。A²RD將長時影片生成構建為閉環過程，透過「檢索－合成－精煉－更新」循環，逐段生成並自我改進影片。該架構包含三大核心組件：(i) 多模態影片記憶，追蹤跨模態的影片進展；(ii) 自適應段落生成，動態切換生成模式以兼顧自然推進與視覺一致性；(iii) 分層測試時自我改進，在影格層與影片層對每個段落進行自我修正，防止誤差累積。此外，我們引入LVBench-C，一個包含非線性實體與環境轉變的挑戰性基準，以嚴格測試長時一致性。在公開基準與LVBench-C上（影片長度從一分鐘至十分鐘），A²RD在一致性上超越當前最佳基線高達30%，在敘事連貫性上提升20%。人工評估證實了這些進步，同時也顯示出在動態與過渡平滑度上的顯著改善。

English

Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A^2RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A^2RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A^2RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.

A²RD：用於長視頻一致性的代理式自回歸擴散

A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency

摘要

Support