A^2RD：面向长视频一致性的智能体自回归扩散

摘要

生成一致且连贯的长视频仍是一项基本挑战。现有方法在长时间跨度中容易出现语义漂移和叙事崩塌。我们提出A²RD，一种代理式自回归扩散架构，将创意生成与一致性保持解耦。A²RD将长视频合成构建为一个闭环过程，通过"检索—合成—精炼—更新"循环逐段合成并自我改进视频。它包含三个核心组件：（i）多模态视频记忆，跨模态追踪视频进展；（ii）自适应片段生成，在多种生成模式间切换以实现自然推进与视觉一致性；（iii）分层测试时自我改进，在帧级和视频级对每个片段进行自我修正，防止错误传播。我们进一步引入LVBench-C，一个具有非线性实体与环境转换的挑战性基准，用于严格测试长时一致性。在涵盖一分钟至十分钟视频的公开基准和LVBench-C上，A²RD在一致性上超越最先进基线高达30%，在叙事连贯性上超越20%。人工评估证实了这些提升，同时指出在运动与过渡平滑性方面的显著改进。

English

Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A^2RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A^2RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A^2RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.

A^2RD：面向长视频一致性的智能体自回归扩散

A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency

摘要

Support