Echo-Infinity: 실시간 무한 비디오 생성을 위한 진화 메모리 학습

초록

우리는 Echo Infinity를 제시한다. 이는 학습 가능한 진화 메모리를 활용하여 일정한 비용으로 임의 길이의 과거를 동적으로 필터링, 추상화 및 압축하는 실시간 무한 비디오 생성을 위한 자기회귀(AR) 프레임워크이다. 기존 방법들은 주로 사전 정의된 KV-캐시 스케줄, 고정 비율 휴리스틱 압축, 또는 추론 시 RoPE 적응을 통해 메모리를 관리한다. 이러한 설계는 제한된 캐시 창과 자기회귀 생성 노이즈를 무시함으로써 필연적으로 과거 정보를 손실하고 오류 누적을 증폭시킨다. 인간의 기억 통합에서 영감을 받은 Echo-Infinity는 수작업으로 구성된 메모리 관리를 학습 가능한 메모리 쿼리로 대체하며, 이는 과거 프레임이 로컬 창에서 제거될 때 어텐션과 게이팅 메커니즘에 의해 업데이트된다. 쿼리는 비디오 확산 트랜스포머(DiTs)와 함께 종단간 최적화되어 진화 메모리를 형성하며, 비디오 길이와 무관하게 일정한 계산으로 임의의 압축 비율을 지원한다. 또한 이는 일반화 가능한 생성 사전 역할을 하여, 최적화된 초기 상태만 사용될 때에도 품질을 향상시킨다. 우리는 통합 상대 RoPE 레시피를 추가로 도입하여, 싱크 프레임을 id 0에서 시작하도록 고정하고, 최신 프레임 id가 훈련 및 추론 전반에 걸쳐 DiTs의 사전 훈련된 최대 시간 RoPE id를 초과하지 않도록 함으로써, 모델이 유한한 RoPE 제약에서 벗어나 훈련-테스트 RoPE 외삽 간극을 해소한다. 긴 비디오 및 짧은 비디오 생성에서 Echo-Infinity는 최첨단 성능을 달성하며, 우리가 아는 한 처음으로 24시간(>130만 프레임) 실시간 롤아웃을 입증하여 무한 비디오 생성을 위한 실용적인 경로를 제시한다.

English

We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.