Yume-1.5：一种文本控制的交互式世界生成模型

摘要

近期研究展示了利用扩散模型生成可交互、可探索虚拟世界的巨大潜力。然而，现有方法大多面临参数量过大、依赖冗长推理步骤、历史上下文快速膨胀等关键挑战，严重限制了实时性能且缺乏文本控制生成能力。为解决这些问题，我们提出\method——一个创新框架，能够从单张图像或文本提示生成逼真、可交互且连续的虚拟世界。该框架通过精心设计的键盘探索机制实现这一目标，其核心包含三大组件：（1）融合统一上下文压缩与线性注意力的长视频生成框架；（2）基于双向注意力蒸馏与增强型文本嵌入方案的实时流式加速策略；（3）面向世界事件生成的文本控制方法。相关代码库已附于补充材料中。

English

Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.

Yume-1.5：一种文本控制的交互式世界生成模型

Yume-1.5: A Text-Controlled Interactive World Generation Model

摘要

Support