Qwen-RobotManip テクニカルレポート：アライメントがロボット操作基盤モデルのスケールを解放する

要旨

言語およびマルチモーダル分野における基盤モデルは、異種データを統一された定式化の下で整列させ、大規模に学習することで強力な汎化を実現する。本報告では、このスケーリング手法がロボット操作にも適用可能であり、真の汎化を達成できるかを検討する。これは困難である。なぜなら、テキストとは異なり、操作データは本質的に異種であり、収集コストが高く、多様性も限られているため、整列とスケーリングを同時に実現することが難しいからである。我々は、Qwen-VLを基盤とした汎化可能な視覚-言語-行動基盤モデルであるQwen-RobotManipを提案する。Qwen-RobotManipは、操作の表現・動作・行動の次元にわたる統一的な整列フレームワークを導入し、大規模なマルチソース学習を矛盾させることなく統合的に行えるようにする。この整列能力により、Qwen-RobotManipは、従来の学習手法では持続できなかった規模の操作データを吸収することが可能となる。人間からロボットへの合成パイプラインは、15のプラットフォームにわたるロボット軌跡へと自己中心的な手のデモンストレーションを変換し、厳格なキュレーションパイプラインが異種データセットを調和させる。独自データ収集を行わず、オープンソースのデータセットと人間のビデオのみを用いて、Qwen-RobotManipは約38,100時間の事前学習コーパスを構築し、ゼロショット命令追従、摂動に対するロバスト性、反応的なエラー回復、異なる身体性間の転送などの創発的な汎化能力を示す。標準的なベンチマークは事前学習の品質を捉えきれないことが判明したため、代わりにRoboCasa365、LIBERO-Plus、EBench、RoboTwin-Clean2Rand、RoboTwin-IF、RoboTwin-XEを含むOOD設定を採用した。Qwen-RobotManipは、π0.5を含む従来の最先端モデルをすべてのOOD設定で大幅に上回り、RoboChallengeで1位（20%の相対的改善）を獲得し、AgileX ALOHA、Franka、UR、ARXなどの実ロボットプラットフォームでも検証された。

English

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including π0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.