RLDX-1技术报告
RLDX-1 Technical Report
May 5, 2026
作者: Dongyoung Kim, Huiwon Jang, Myungkyu Koo, Suhyeok Jang, Taeyoung Kim, Beomjun Kim, Byungjun Yoon, Changsung Jang, Daewon Choi, Dongsu Han, Donguk Lee, Heeseung Kwon, Hojin Jeon, Jaehyun Kang, Jaekyoung Bae, Jihyuk Lee, Jimin Lee, John Won, Joonwoo Ahn, Junhyeong Park, Junyoung Sung, Kyungmin Lee, Minseong Han, Minsung Yoon, Sejune Joo, Seonil Son, Seungcheol Park, Seunggeun Cho, Seungjun Moon, Seungku Kim, Yonghoon Dong, Yongjin Cho, Youngchan Kim, Chang Hwan Kim, Dohyeon Kim, Hazel Lee, Heecheol Kim, Hensen Ahn, Hyungkyu Ryu, Hyunsoo Choi, Hyunsoo Shin, Jaeheon Jung, Jaewoo Kim, Jinwook Kim, Joochul Chang, Joonsoo Kim, Junghun Park, Jungwoo Park, Junho Cho, Junhyeok Park, Junwon Lee, Kangwook Lee, Kwanghoon Kim, Kyoungwhan Choe, Manoj Bhadu, Nayoung Oh, Sangjun Kim, Sangwoo Kim, Seunghoon Shim, Seunghyun Kim, Seungjun Lee, Seungyup Ka, Sungryol Yang, Wook Jung, Yashu Shukla, Yeonjae Lee, Yeonwoo Bae, Jinwoo Shin
cs.AI
摘要
尽管视觉-语言-动作模型(VLA)通过预训练视觉-语言模型继承的多功能智能(即广泛的场景理解和语言条件泛化能力),在实现类人通用机器人策略方面取得了显著进展,但其在需要更广泛功能能力(如运动感知、记忆决策和物理传感)的复杂现实任务中仍存在不足。为此,我们推出RLDX-1——一个基于多流动作 Transformer(MSAT)架构的通用灵巧操作机器人策略。该架构通过模态专用流与跨模态联合自注意力机制整合异构模态,统一了上述功能能力。RLDX-1进一步结合系统级设计选择,包括为罕见操作场景合成训练数据、专为类人操作设计的学习流程,以及面向实时部署的推理优化。实证评估表明,在需要超越通用性的广泛功能能力的仿真基准和现实任务中,RLDX-1持续优于前沿VLA模型(如π_{0.5}和GR00T N1.6)。特别是在ALLEX人形机器人任务中,RLDX-1以86.8%的成功率显著优于仅达40%左右的对比模型,凸显了其在多样化功能需求下控制高自由度人形机器人的能力。这些成果共同表明,RLDX-1为开发适用于复杂、高接触性及动态现实灵巧操作的可靠VLA迈出了重要一步。
English
While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. π_{0.5} and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while π_{0.5} and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.