유메-1.5: 텍스트 제어 인터랙티브 월드 생성 모델

초록

최근 연구들은 확산 모델을 활용하여 상호작용 및 탐색이 가능한 세계를 생성하는 가능성을 보여주었습니다. 그러나 대부분의 방법은 과도하게 큰 매개변수 크기, 긴 추론 단계에 대한 의존성, 급속히 증가하는 역사적 컨텍스트와 같은 중요한 문제에 직면하여 실시간 성능을 심각하게 제한하고 텍스트 기반 생성 기능이 부족합니다. 이러한 문제를 해결하기 위해 우리는 단일 이미지 또는 텍스트 프롬프트부터 현실적이고 상호작용적이며 연속적인 세계를 생성하도록 설계된 새로운 프레임워크인 \method를 제안합니다. \method는 키보드 기반 탐색을 지원하는 정교하게 설계된 프레임워크를 통해 이를 달성합니다. 이 프레임워크는 세 가지 핵심 구성 요소로 이루어집니다: (1) 통합 컨텍스트 압축과 선형 어텐션을 결합한 장영상 생성 프레임워크, (2) 양방향 어텐션 증류와 향상된 텍스트 임베딩 기법으로 구동되는 실시간 스트리밍 가속 전략, (3) 세계 내 사건 생성을 위한 텍스트 제어 방법. 관련 코드베이스는 보충 자료에 제공하였습니다.

English

Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.

유메-1.5: 텍스트 제어 인터랙티브 월드 생성 모델

Yume-1.5: A Text-Controlled Interactive World Generation Model

초록

Support