ChatPaper.aiChatPaper

RLDX-1 技術報告

RLDX-1 Technical Report

May 5, 2026
作者: Dongyoung Kim, Huiwon Jang, Myungkyu Koo, Suhyeok Jang, Taeyoung Kim, Beomjun Kim, Byungjun Yoon, Changsung Jang, Daewon Choi, Dongsu Han, Donguk Lee, Heeseung Kwon, Hojin Jeon, Jaehyun Kang, Jaekyoung Bae, Jihyuk Lee, Jimin Lee, John Won, Joonwoo Ahn, Junhyeong Park, Junyoung Sung, Kyungmin Lee, Minseong Han, Minsung Yoon, Sejune Joo, Seonil Son, Seungcheol Park, Seunggeun Cho, Seungjun Moon, Seungku Kim, Yonghoon Dong, Yongjin Cho, Youngchan Kim, Chang Hwan Kim, Dohyeon Kim, Hazel Lee, Heecheol Kim, Hensen Ahn, Hyungkyu Ryu, Hyunsoo Choi, Hyunsoo Shin, Jaeheon Jung, Jaewoo Kim, Jinwook Kim, Joochul Chang, Joonsoo Kim, Junghun Park, Jungwoo Park, Junho Cho, Junhyeok Park, Junwon Lee, Kangwook Lee, Kwanghoon Kim, Kyoungwhan Choe, Manoj Bhadu, Nayoung Oh, Sangjun Kim, Sangwoo Kim, Seunghoon Shim, Seunghyun Kim, Seungjun Lee, Seungyup Ka, Sungryol Yang, Wook Jung, Yashu Shukla, Yeonjae Lee, Yeonwoo Bae, Jinwoo Shin
cs.AI

摘要

儘管視覺-語言-動作模型(VLA)透過預訓練視覺-語言模型繼承的多功能智能(即廣泛場景理解與語言條件泛化能力),在實現類人通用機器人策略方面取得顯著進展,但其在處理需要更廣泛功能能力(如運動感知、記憶感知決策和物理感知)的複雜現實任務時仍面臨挑戰。為此,我們推出RLDX-1——一個基於多流動作轉換器(MSAT)的通用靈巧操作機器人策略。MSAT架構通過模態專用流與跨模態聯合自注意力機制整合異構模態,統一了上述功能能力。RLDX-1進一步結合系統層級設計選擇,包括合成罕見操作場景的訓練數據、專為類人操作設計的學習流程,以及面向實時部署的推理優化。實證評估顯示,在需要超越通用性的廣泛功能能力的仿真基準與現實任務中,RLDX-1持續優於近期前沿VLA模型(如π_{0.5}和GR00T N1.6)。特別是在ALLEX人形機器人任務中,RLDX-1以86.8%的成功率展現顯著優勢(π_{0.5}和GR00T N1.6僅達約40%),凸顯其在高自由度人形機器人面對多樣功能需求時的控制能力。這些成果共同確立RLDX-1作為邁向可靠VLA的重要進展,能勝任複雜、高接觸性及動態的現實世界靈巧操作任務。
English
While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. π_{0.5} and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while π_{0.5} and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.
PDF851May 8, 2026