DexHoldem：使用灵巧具身系统玩德州扑克

摘要

在真实灵巧硬件上评估具身系统所需的远不止孤立的基元技能：智能体必须感知变化的桌面场景，选择符合上下文的动作，用灵巧手执行该动作，并确保场景在后续决策中仍可被使用。我们提出了 DexHoldem——一个基于 ShadowHand 的德州扑克灵巧操作的真实世界系统级基准。DexHoldem 提供了横跨 14 种德州扑克操作基元的 1470 个遥操作演示、一个标准化的物理策略基准，以及一个智能体感知基准——用于测试智能体能否恢复具身决策所需的结构化游戏状态。在基元执行方面，π_{0.5} 取得了最高任务完成率（61.2%），而 π_{0.5} 和 π_0 在场景保持成功率上持平（47.5%）。在智能体感知方面，Opus 4.7 取得了最优的严格问题级准确率（34.3%），而 GPT 5.5 取得了最优的平均字段级准确率（66.8%），这揭示了孤立的视觉子能力与完整的路径相关状态恢复之间的差距。最后，我们在三个案例研究中实例化了完整的具身智能体循环，其中等待行为、恢复调度、人工帮助请求以及重复的基元执行揭示了在闭环部署中感知与策略误差如何累积。因此，DexHoldem 在共享的物理场景中评估了灵巧桌面操作、智能体感知以及具身决策路由。项目页面：https://dexholdem.github.io/Dexholdem/。

English

Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, π_{0.5} obtains the highest task completion rate (61.2%), while π_{0.5} and π_0 tie on scene-preserving success rate (47.5%). On agentic perception, Opus 4.7 obtains the best strict problem-level accuracy (34.3%), while GPT 5.5 obtains the best average field-wise accuracy (66.8%), exposing a gap between isolated visual sub-capabilities and complete routing-relevant state recovery. Finally, we instantiate the full embodied-agent loop in three case studies, where waiting, recovery dispatches, human-help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed-loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: https://dexholdem.github.io/Dexholdem/.