Inferix:基於區塊擴散技術的新世代世界模擬推理引擎
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
November 25, 2025
作者: Inferix Team, Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, Zhiyuan Wang, Jichao Wu, Mingyang Yang, Yinghao Yu, Zeyu Zhang, Bohan Zhuang
cs.AI
摘要
世界模型作為智能體人工智慧、具身人工智慧及遊戲等領域的核心模擬器,能夠生成具備物理真實性、可互動的長時高品質影片。更重要的是,通過擴展這些模型,有望在視覺感知、理解與推理方面湧現新能力,從而開創超越當前以大型語言模型為中心的視覺基礎模型新範式。實現這一突破的關鍵在於半自回歸(區塊擴散)解碼範式,它融合了擴散方法與自回歸方法的優勢:通過在每個區塊內應用擴散生成影片標記,同時以先前區塊為條件進行約束,最終產生更連貫穩定的影片序列。該範式的核心突破在於重新引入類大型語言模型的KV快取管理機制,克服了標準影片擴散模型的限制,實現高效、可變長度且高品質的生成能力。
為此,Inferix被專門設計為新一代推理引擎,通過優化的半自回歸解碼流程實現沉浸式世界合成。這種對世界模擬的專注定位,使其有別於面向高併發場景的系統(如vLLM或SGLang),也不同於傳統影片擴散模型(如xDiTs)。Inferix進一步結合互動式影片串流與效能分析功能,支援即時互動與真實模擬,精準刻畫世界動態。此外,透過無縫整合LV-Bench——專為分鐘級影片生成場景設計的新型細粒度評估基準,系統可實現高效能基準測試。我們期待學界攜手推進Inferix發展,共同拓展世界模型的研究邊界。
English
World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation.
Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.