EgoCS-400K：用於世界模型的自我中心遊戲過程數據集

摘要

從影片生成到互動式世界建模的轉變，對數據提出了新的要求：除了帶有標題的影片外，世界模型還需要基於驅動未來場景變化的動作、攝影機運動、狀態與事件，來建立時間對齊的影片-動作-語言軌跡。然而，這類數據難以大規模取得。網路影片數據集提供廣闊的視覺涵蓋範圍，但缺乏可執行動作與可靠狀態；機器人數據集提供動作與狀態監督，但成本高昂且場景多樣性有限；現有模擬器往往缺乏大規模的人類驅動互動軌跡。在本文中，我們介紹 EgoCS-400K，一個大規模、基於重播的自我中心《絕對武力》數據集，專為世界模型設計，建構自公開的職業 CS 與 CS2 比賽重播檔，保留人類遊玩軌跡，並支援解析、重播、渲染與時間對齊。我們萃取玩家狀態、視角方向、移動、鍵盤/按鍵輸入、視角變化、武器使用、遊戲事件與回合層級上下文，並從相同軌跡渲染出清晰的第一人稱影片。EgoCS-400K 包含超過 40 萬部第一人稱影片與 1 萬小時的遊玩內容，來自超過 1,000 場比賽與 4 萬個回合，涵蓋 13 張地圖，每回合提供 10 個玩家視角。它支援多項互動式視覺建模任務，包括動作條件下的未來預測、狀態與事件感知的場景推演、基於重播的標註，以及智能體的自我中心動作理解。透過大規模連結視覺觀測與人類動作、攝影機運動、遊戲狀態與事件，EgoCS-400K 可作為被動網路影片、可控遊戲模擬與昂貴的真實世界具身數據之間的實用橋樑。

English

The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data is difficult to obtain at scale. Web video datasets offer broad visual coverage but lack executable actions and reliable states; robotic datasets provide action and state supervision but are costly and limited in scene diversity; and existing simulators often lack large-scale human-driven interaction trajectories. In this paper, we introduce EgoCS-400K, a large-scale replay-grounded egocentric Counter-Strike dataset for world models, built from public professional CS and CS2 match demos that preserve human gameplay trajectories and enable parsing, replaying, rendering, and temporal alignment. We extract player states, view directions, movements, keyboard/button inputs, view-angle changes, weapon usage, game events, and round-level context, and render clean first-person videos from the same trajectories. EgoCS-400K contains over 400,000 first-person videos and 10,000 hours of gameplay from more than 1,000 matches and 40,000 rounds, covering 13 maps and 10 player viewpoints per round. It supports a range of interactive visual modeling tasks, including action-conditioned future prediction, state- and event-aware scene rollout, replay-grounded captioning, and agent egocentric action understanding. By connecting visual observations with human actions, camera motion, game states, and events at scale, EgoCS-400K serves as a practical bridge between passive web videos, controllable game simulation, and costly real-world embodied data.