ChatPaper.aiChatPaper

Infinity-RoPE:基於自回歸自我推演的動作可控無限影片生成技術

Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

November 25, 2025
作者: Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag
cs.AI

摘要

當前基於自迴歸的視訊擴散模型面臨三大核心瓶頸:(i) 基礎模型的三維旋轉位置編碼(3D-RoPE)施加的有限時間視野限制;(ii) 長序列生成過程中提示詞響應遲滯,難以維持細粒度動作控制;(iii) 無法在單一生成流中實現非連續的電影式場景轉場。我們提出 infty-RoPE——一個統一的推理時框架,通過三個相互關聯的組件突破所有限制:塊相對論 RoPE、KV 刷新與 RoPE 截斷。塊相對論 RoPE 將時間編碼重構為移動局部參考系,使新生成的潛在塊相對於基礎模型的最大幀視野進行旋轉,同時將早前生成的塊向後旋轉以保持相對時間幾何關係。這種相對論公式消除了固定時間位置,實現超越基礎位置限制的連續視訊生成。為實現無需重新編碼的細粒度動作控制,KV 刷新機制僅保留全局錨點幀與最新生成的潛在幀來更新 KV 緩存,從而確保即時的提示詞響應。最後,RoPE 截斷通過在時間 RoPE 座標中引入受控間斷,實現單次連續生成內的多鏡頭場景轉場。這些組件共同使 infty-RoPE 成為無需訓練即可實現無限時長、可控且具電影感視訊生成的基礎框架。全面實驗表明,infty-RoPE 在 VBench 綜合評分中持續超越現有自迴歸模型。
English
Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce infty-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish infty-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that infty-RoPE consistently surpasses previous autoregressive models in overall VBench scores.
PDF351December 3, 2025