DexHoldem: 器用な身体化システムによるテキサスホールデムのプレイ

要旨

実際の巧緻なハードウェア上で身体化システムを評価するには、孤立したプリミティブスキル以上のものが必要である。エージェントは変化するテーブルトップのシーンを知覚し、文脈に適した行動を選択し、巧緻なハンドでそれを実行し、後の判断のためにシーンを使用可能な状態に保たなければならない。我々は、ShadowHandを用いたテキサスホールデムの巧緻操作に基づく実世界のシステムレベルのベンチマーク、DexHoldemを紹介する。DexHoldemは、14のテキサスホールデム操作プリミティブにわたる1,470件の遠隔操作デモンストレーション、標準化された物理的ポリシーベンチマーク、およびエージェントが身体化された意思決定に必要な構造化されたゲーム状態を復元できるかどうかをテストするエージェンティック知覚ベンチマークを提供する。プリミティブ実行において、π_{0.5}は最高のタスク完了率（61.2%）を達成し、π_{0.5}とπ_0はシーン保存成功率（47.5%）で同率となる。エージェンティック知覚において、Opus 4.7は最も優れた厳密な問題レベル精度（34.3%）を達成し、GPT 5.5は最も優れた平均フィールド別精度（66.8%）を達成し、孤立した視覚サブ能力と完全なルーティング関連状態復元との間のギャップを明らかにする。最後に、我々は完全な身体化エージェントループを3つのケーススタディで具体化し、待機、リカバリディスパッチ、人間による支援要求、および反復的なプリミティブ実行が、クローズドループ展開中に知覚とポリシーのエラーがどのように蓄積されるかを明らかにする。したがって、DexHoldemは、共通の物理的設定において、巧緻なテーブルトップ実行、エージェンティック知覚、および身体化された意思決定ルーティングを評価する。プロジェクトページ：https://dexholdem.github.io/Dexholdem/。

English

Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, π_{0.5} obtains the highest task completion rate (61.2%), while π_{0.5} and π_0 tie on scene-preserving success rate (47.5%). On agentic perception, Opus 4.7 obtains the best strict problem-level accuracy (34.3%), while GPT 5.5 obtains the best average field-wise accuracy (66.8%), exposing a gap between isolated visual sub-capabilities and complete routing-relevant state recovery. Finally, we instantiate the full embodied-agent loop in three case studies, where waiting, recovery dispatches, human-help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed-loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: https://dexholdem.github.io/Dexholdem/.