RLDX-1技术报告

摘要

尽管视觉-语言-动作模型（VLA）通过预训练视觉-语言模型继承的多功能智能（即广泛的场景理解和语言条件泛化能力），在实现类人通用机器人策略方面取得了显著进展，但其在需要更广泛功能能力（如运动感知、记忆决策和物理传感）的复杂现实任务中仍存在不足。为此，我们推出RLDX-1——一个基于多流动作 Transformer（MSAT）架构的通用灵巧操作机器人策略。该架构通过模态专用流与跨模态联合自注意力机制整合异构模态，统一了上述功能能力。RLDX-1进一步结合系统级设计选择，包括为罕见操作场景合成训练数据、专为类人操作设计的学习流程，以及面向实时部署的推理优化。实证评估表明，在需要超越通用性的广泛功能能力的仿真基准和现实任务中，RLDX-1持续优于前沿VLA模型（如π_{0.5}和GR00T N1.6）。特别是在ALLEX人形机器人任务中，RLDX-1以86.8%的成功率显著优于仅达40%左右的对比模型，凸显了其在多样化功能需求下控制高自由度人形机器人的能力。这些成果共同表明，RLDX-1为开发适用于复杂、高接触性及动态现实灵巧操作的可靠VLA迈出了重要一步。

English

While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. π_{0.5} and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while π_{0.5} and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.

RLDX-1技术报告

RLDX-1 Technical Report

摘要

Support