A^2RD: エージェント的自己回帰拡散による長動画の一貫性

要旨

長編映像の一貫性と首尾一貫性を合成することは、依然として根本的な課題である。既存手法は、長い時間軸にわたって意味の逸脱や物語の崩壊を起こす。本稿では、創造的合成と一貫性の維持を分離するエージェント型自己回帰拡散アーキテクチャA²RDを提案する。A²RDは、長編映像合成を閉ループプロセスとして定式化し、検索-合成-洗練-更新のサイクルを通じて映像セグメントを逐次的に合成し自己改善する。本アーキテクチャは以下の3つの中核的構成要素からなる。(i) モダリティ横断的に映像の進行を追跡するマルチモーダル映像記憶、(ii) 自然な進行と視覚的一貫性のために生成モードを切り替える適応的セグメント生成、(iii) 各セグメントをフレームレベルおよび映像レベルで自己改善し誤差伝播を防ぐ階層的テスト時自己改善。さらに、長期的な一貫性を厳格に評価するため、非線形なエンティティおよび環境遷移を含む挑戦的なベンチマークLVBench-Cを導入する。1分から10分の映像をカバーする公開ベンチマークおよびLVBench-Cにおいて、A²RDは最先端のベースラインと比較して、一貫性で最大30%、物語の首尾一貫性で20%優れた性能を示す。人間による評価もこれらの向上を裏付けるとともに、動きと遷移の滑らかさにおける顕著な改善も強調している。

English

Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A^2RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A^2RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A^2RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.

A^2RD: エージェント的自己回帰拡散による長動画の一貫性

A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency

要旨

Support