A^2RD:面向长视频一致性的智能体自回归扩散
A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency
May 7, 2026
作者: Do Xuan Long, Yale Song, Min-Yen Kan, Tomas Pfister, Long T. Le
cs.AI
摘要
生成一致且连贯的长视频仍是一项基本挑战。现有方法在长时间跨度中容易出现语义漂移和叙事崩塌。我们提出A²RD,一种代理式自回归扩散架构,将创意生成与一致性保持解耦。A²RD将长视频合成构建为一个闭环过程,通过"检索—合成—精炼—更新"循环逐段合成并自我改进视频。它包含三个核心组件:(i)多模态视频记忆,跨模态追踪视频进展;(ii)自适应片段生成,在多种生成模式间切换以实现自然推进与视觉一致性;(iii)分层测试时自我改进,在帧级和视频级对每个片段进行自我修正,防止错误传播。我们进一步引入LVBench-C,一个具有非线性实体与环境转换的挑战性基准,用于严格测试长时一致性。在涵盖一分钟至十分钟视频的公开基准和LVBench-C上,A²RD在一致性上超越最先进基线高达30%,在叙事连贯性上超越20%。人工评估证实了这些提升,同时指出在运动与过渡平滑性方面的显著改进。
English
Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A^2RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A^2RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A^2RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.