DexHoldem: 정교한 체화 시스템을 통한 텍사스 홀덤 플레이

초록

실제 고도의 손재주 하드웨어에서 체화된 시스템을 평가하려면 고립된 기본 기술 이상이 필요하다. 에이전트는 변화하는 탁자 위 장면을 인식하고, 상황에 적합한 동작을 선택하며, 손재주가 뛰어난 손으로 이를 실행하고, 이후의 결정에 사용할 수 있도록 장면을 유지해야 한다. 본 논문에서는 ShadowHand를 사용한 텍사스 홀덤 고난도 조작을 기반으로 구축된 실제 세계 시스템 수준의 벤치마크인 DexHoldem을 소개한다. DexHoldem은 14가지 텍사스 홀덤 조작 기본 동작에 걸친 1,470회의 원격 조작 시연, 표준화된 물리적 정책 벤치마크, 그리고 에이전트가 체화된 의사 결정에 필요한 구조화된 게임 상태를 복구할 수 있는지 테스트하는 에이전트 지각 벤치마크를 제공한다. 기본 동작 실행에서 π_{0.5}는 가장 높은 작업 완료율(61.2%)을 달성한 반면, π_{0.5}와 π_0는 장면 유지 성공률(47.5%)에서 동률을 기록했다. 에이전트 지각에서는 Opus 4.7이 가장 높은 엄격한 문제 수준 정확도(34.3%)를 기록했고, GPT 5.5가 가장 높은 평균 필드별 정확도(66.8%)를 기록하여, 고립된 시각적 하위 능력과 완전한 라우팅 관련 상태 복구 사이의 격차를 드러냈다. 마지막으로, 세 가지 사례 연구에서 완전한 체화된 에이전트 루프를 구현했으며, 여기서 대기, 복구 디스패치, 인간 도움 요청 및 반복적인 기본 동작 실행이 폐쇄 루프 배포 중에 지각 및 정책 오류가 어떻게 축적되는지 보여준다. 따라서 DexHoldem은 공유된 물리적 환경에서 고난도 탁자 위 실행, 에이전트 지각 및 체화된 의사 결정 라우팅을 평가한다. 프로젝트 페이지: https://dexholdem.github.io/Dexholdem/.

English

Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, π_{0.5} obtains the highest task completion rate (61.2%), while π_{0.5} and π_0 tie on scene-preserving success rate (47.5%). On agentic perception, Opus 4.7 obtains the best strict problem-level accuracy (34.3%), while GPT 5.5 obtains the best average field-wise accuracy (66.8%), exposing a gap between isolated visual sub-capabilities and complete routing-relevant state recovery. Finally, we instantiate the full embodied-agent loop in three case studies, where waiting, recovery dispatches, human-help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed-loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: https://dexholdem.github.io/Dexholdem/.