Yume-1.5:文字驅動的互動式世界生成模型
Yume-1.5: A Text-Controlled Interactive World Generation Model
December 26, 2025
作者: Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang
cs.AI
摘要
近期研究顯示,擴散模型在生成可互動探索的虛擬世界方面具有巨大潛力。然而,現有方法大多面臨關鍵挑戰:參數規模過於龐大、依賴冗長的推理步驟、歷史上下文快速增長,這些問題嚴重限制了實時性能,且缺乏文本控制生成能力。為解決這些難題,我們提出\method——一個創新框架,能從單張圖像或文本提示生成逼真、可互動且連續的虛擬世界。該框架通過精心設計的鍵盤探索機制實現這一目標,其核心包含三個組件:(1)融合統一上下文壓縮與線性注意力的長影片生成框架;(2)基於雙向注意力蒸餾與增強型文本嵌入方案的實時串流加速策略;(3)用於生成世界事件的文本控制方法。我們已於補充材料中提供程式碼庫。
English
Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.