Infinity-RoPE: 자기회귀적 자기 롤아웃에서 등장하는 행동 제어 가능한 무한 비디오 생성

초록

현재의 자기회귀 비디오 확산 모델은 세 가지 핵심 병목 현상에 의해 제약을 받습니다: (i) 기본 모델의 3D 회전 위치 임베딩(3D-RoPE)에 의해 부과된 유한한 시간 범위, (ii) 장편 롤아웃 동안 세부 동작 제어를 유지하는 데 있어 느린 프롬프트 반응성, (iii) 단일 생성 스트림 내에서 불연속적인 시네마틱 전환을 구현할 수 없는 점. 우리는 이 세 가지 한계를 상호 연결된 세 가지 구성 요소(블록-상대론적 RoPE, KV 플러시, RoPE 컷)를 통해 해결하는 통합 추론 시점 프레임워크인 infty-RoPE를 소개합니다. 블록-상대론적 RoPE는 시간 인코딩을 움직이는 지역 기준 좌표계로 재구성하여, 새로 생성된 각 잠재 블록은 기본 모델의 최대 프레임 범위를 기준으로 회전시키는 동시에 이전 블록들은 상대적 시간 기하학을 보존하기 위해 역방향으로 회전시킵니다. 이 상대론적 공식화는 고정된 시간 위치를 제거하여 기본 위치 한계를 훨씬 넘어선 연속적인 비디오 생성을 가능하게 합니다. 재인코딩 없이 세부적인 동작 제어를 얻기 위해, KV 플러시는 글로벌 싱크와 마지막으로 생성된 잠재 프레임 단 두 개의 잠재 프레임만을 보유하여 KV 캐시를 갱신함으로써 즉각적인 프롬프트 반응성을 보장합니다. 마지막으로, RoPE 컷은 시간 RoPE 좌표에 제어된 불연속성을 도입하여 단일 연속 롤아웃 내에서 다중 컷 장면 전환을 가능하게 합니다. 이러한 구성 요소들이 함께 작동하여 infty-RoPE는 무한 시간 범위, 제어 가능, 시네마틱한 비디오 확산을 위한 학습 불필요 기반을 마련합니다. 포괄적인 실험을 통해 infty-RoPE가 전체 VBench 점수에서 기존 자기회귀 모델들을 지속적으로 능가함을 보여줍니다.

English

Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce infty-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish infty-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that infty-RoPE consistently surpasses previous autoregressive models in overall VBench scores.