Cortex 2.0：実世界の産業環境への展開に基づく世界モデルの構築

要旨

産業用ロボットマニピュレーションには、異なる実装形態・タスク・変化する物体分布にまたがる信頼性の高い長期実行が求められる。視覚言語行動モデルは強力な一般化能力を示すものの、本質的には反応型の枠組みに留まっている。現在の観測から次の行動を最適化する一方で将来の可能性を評価しないため、長期タスクにおける連鎖的な失敗モードに対して脆弱である。Cortex 2.0は、視覚的潜在空間で将来の軌道候補を生成し、期待される成功率と効率性で評価した後、最高スコアの候補のみを実行する「計画実行型」アプローチへと転換する。単腕および双腕マニピュレーションプラットフォームを用い、ピックアンドプレース、物品と廃棄物の分別、ネジ分別、靴箱の開梱という複雑性が増す4タスクで評価を実施した。Cortex 2.0は全てのタスクで最高の結果を達成し、最先端の視覚言語行動ベースライン手法を一貫して上回った。重度の雑然性、頻繁な遮蔽、接触を多用する操作が特徴的な非構造化環境においても、反応型ポリシーが失敗する場面で本システムは信頼性を維持した。これらの結果は、世界モデルに基づく計画手法が複雑な産業環境で確実に機能し得ることを実証している。

English

Industrial robotic manipulation demands reliable long-horizon execution across embodiments, tasks, and changing object distributions. While Vision-Language-Action models have demonstrated strong generalization, they remain fundamentally reactive. By optimizing the next action given the current observation without evaluating potential futures, they are brittle to the compounding failure modes of long-horizon tasks. Cortex 2.0 shifts from reactive control to plan-and-act by generating candidate future trajectories in visual latent space, scoring them for expected success and efficiency, then committing only to the highest-scoring candidate. We evaluate Cortex 2.0 on a single-arm and dual-arm manipulation platform across four tasks of increasing complexity: pick and place, item and trash sorting, screw sorting, and shoebox unpacking. Cortex 2.0 consistently outperforms state-of-the-art Vision-Language-Action baselines, achieving the best results across all tasks. The system remains reliable in unstructured environments characterized by heavy clutter, frequent occlusions, and contact-rich manipulation, where reactive policies fail. These results demonstrate that world-model-based planning can operate reliably in complex industrial environments.

Cortex 2.0：実世界の産業環境への展開に基づく世界モデルの構築

Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

要旨

Support