RLDX-1 技術報告

摘要

儘管視覺-語言-動作模型（VLA）透過預訓練視覺-語言模型繼承的多功能智能（即廣泛場景理解與語言條件泛化能力），在實現類人通用機器人策略方面取得顯著進展，但其在處理需要更廣泛功能能力（如運動感知、記憶感知決策和物理感知）的複雜現實任務時仍面臨挑戰。為此，我們推出RLDX-1——一個基於多流動作轉換器（MSAT）的通用靈巧操作機器人策略。MSAT架構通過模態專用流與跨模態聯合自注意力機制整合異構模態，統一了上述功能能力。RLDX-1進一步結合系統層級設計選擇，包括合成罕見操作場景的訓練數據、專為類人操作設計的學習流程，以及面向實時部署的推理優化。實證評估顯示，在需要超越通用性的廣泛功能能力的仿真基準與現實任務中，RLDX-1持續優於近期前沿VLA模型（如π_{0.5}和GR00T N1.6）。特別是在ALLEX人形機器人任務中，RLDX-1以86.8%的成功率展現顯著優勢（π_{0.5}和GR00T N1.6僅達約40%），凸顯其在高自由度人形機器人面對多樣功能需求時的控制能力。這些成果共同確立RLDX-1作為邁向可靠VLA的重要進展，能勝任複雜、高接觸性及動態的現實世界靈巧操作任務。

English

While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. π_{0.5} and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while π_{0.5} and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.

RLDX-1 技術報告

RLDX-1 Technical Report

摘要

Support