EgoCS-400K: 面向世界模型的第一人称游戏数据集

摘要

从视频生成到交互式世界建模的转变对数据提出了新要求：除了带字幕的视频外，世界模型还需要基于驱动未来场景变化的动作、相机运动、状态和事件，形成时间对齐的视频-动作-语言轨迹。然而，此类数据难以大规模获取。网络视频数据集虽然视觉覆盖广泛，但缺乏可执行的动作和可靠的状态；机器人数据集提供了动作和状态监督，但成本高昂且场景多样性有限；现有的模拟器往往缺乏大规模的人类驱动交互轨迹。本文中，我们提出EgoCS-400K，一个面向世界模型的大规模基于回放的第一人称反恐精英数据集。该数据集基于公开的职业CS与CS2比赛回放文件构建，保留了人类游戏轨迹，并支持解析、回放、渲染及时间对齐。我们提取了玩家状态、视角方向、移动、键盘/按键输入、视角角度变化、武器使用、游戏事件及回合级上下文信息，并从相同轨迹中渲染出清晰的第一人称视频。EgoCS-400K包含超过40万段第一人称视频和1万小时的游戏时长，源自1000多场比赛和4万多个回合，涵盖13张地图，每个回合包含10个玩家视角。该数据集支持多种交互式视觉建模任务，包括基于动作的未来预测、状态与事件感知的场景展开、基于回放的字幕生成以及智能体第一人称动作理解。通过在规模上将视觉观测与人类动作、相机运动、游戏状态及事件相连接，EgoCS-400K在被动的网络视频、可控的游戏模拟与昂贵的真实世界具身数据之间搭建了实用的桥梁。

English

The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data is difficult to obtain at scale. Web video datasets offer broad visual coverage but lack executable actions and reliable states; robotic datasets provide action and state supervision but are costly and limited in scene diversity; and existing simulators often lack large-scale human-driven interaction trajectories. In this paper, we introduce EgoCS-400K, a large-scale replay-grounded egocentric Counter-Strike dataset for world models, built from public professional CS and CS2 match demos that preserve human gameplay trajectories and enable parsing, replaying, rendering, and temporal alignment. We extract player states, view directions, movements, keyboard/button inputs, view-angle changes, weapon usage, game events, and round-level context, and render clean first-person videos from the same trajectories. EgoCS-400K contains over 400,000 first-person videos and 10,000 hours of gameplay from more than 1,000 matches and 40,000 rounds, covering 13 maps and 10 player viewpoints per round. It supports a range of interactive visual modeling tasks, including action-conditioned future prediction, state- and event-aware scene rollout, replay-grounded captioning, and agent egocentric action understanding. By connecting visual observations with human actions, camera motion, game states, and events at scale, EgoCS-400K serves as a practical bridge between passive web videos, controllable game simulation, and costly real-world embodied data.