Cortex 2.0：将世界模型落地于真实工业场景的实践探索

摘要

工业机器人操作需要在不同本体、任务及变化物体分布下实现可靠的长期执行。虽然视觉-语言-动作模型已展现强大泛化能力，但其本质仍属被动反应型。这类模型仅根据当前观察优化下一步动作，缺乏对未来可能性的评估，因此在应对长期任务中的复合故障模式时表现脆弱。Cortex 2.0通过生成视觉潜在空间中的候选未来轨迹，对其预期成功率和效率进行评分，并仅执行最高分候选方案，实现了从被动控制到规划执行的范式转变。我们在单臂与双臂操作平台上评估了四个复杂度递增的任务：抓取放置、物品与垃圾分拣、螺丝分拣以及鞋盒拆包。Cortex 2.0在所有任务中均持续超越最先进的视觉-语言-动作基线模型，取得最佳性能。该系统在具有严重杂乱、频繁遮挡和密集接触的非结构化环境中仍保持可靠性，而被动策略在此类场景中往往失效。这些结果表明基于世界模型的规划方法能够在复杂工业环境中稳定运行。

English

Industrial robotic manipulation demands reliable long-horizon execution across embodiments, tasks, and changing object distributions. While Vision-Language-Action models have demonstrated strong generalization, they remain fundamentally reactive. By optimizing the next action given the current observation without evaluating potential futures, they are brittle to the compounding failure modes of long-horizon tasks. Cortex 2.0 shifts from reactive control to plan-and-act by generating candidate future trajectories in visual latent space, scoring them for expected success and efficiency, then committing only to the highest-scoring candidate. We evaluate Cortex 2.0 on a single-arm and dual-arm manipulation platform across four tasks of increasing complexity: pick and place, item and trash sorting, screw sorting, and shoebox unpacking. Cortex 2.0 consistently outperforms state-of-the-art Vision-Language-Action baselines, achieving the best results across all tasks. The system remains reliable in unstructured environments characterized by heavy clutter, frequent occlusions, and contact-rich manipulation, where reactive policies fail. These results demonstrate that world-model-based planning can operate reliably in complex industrial environments.