Cortex 2.0：将世界模型落地于真实工业部署的实践探索

摘要

工业机器人操控需要在不同本体、任务及变化的对象分布中实现可靠的长时程执行。尽管视觉-语言-动作模型已展现出强大的泛化能力，但其本质上仍属于反应式控制。通过仅根据当前观察优化下一步动作而不评估潜在未来状态，这类模型在面对长时程任务中叠加的故障模式时表现脆弱。Cortex 2.0通过生成视觉潜在空间中的候选未来轨迹，对其预期成功率和效率进行评分，并仅执行最高评分候选方案，实现了从反应式控制向"规划-执行"模式的转变。我们在单臂与双臂操控平台上评估了Cortex 2.0在四项复杂度递增的任务中的表现：物品取放、物品与垃圾分类、螺丝分拣以及鞋盒拆包。实验表明，Cortex 2.0在所有任务中均持续超越最先进的视觉-语言-动作基线模型，取得最佳性能。该系统在具有重度杂乱、频繁遮挡和密集接触操作特征的非结构化环境中仍保持可靠性，而反应式策略在此类环境下往往失效。这些结果证明基于世界模型的规划方法能够在复杂工业环境中稳定运行。

English

Industrial robotic manipulation demands reliable long-horizon execution across embodiments, tasks, and changing object distributions. While Vision-Language-Action models have demonstrated strong generalization, they remain fundamentally reactive. By optimizing the next action given the current observation without evaluating potential futures, they are brittle to the compounding failure modes of long-horizon tasks. Cortex 2.0 shifts from reactive control to plan-and-act by generating candidate future trajectories in visual latent space, scoring them for expected success and efficiency, then committing only to the highest-scoring candidate. We evaluate Cortex 2.0 on a single-arm and dual-arm manipulation platform across four tasks of increasing complexity: pick and place, item and trash sorting, screw sorting, and shoebox unpacking. Cortex 2.0 consistently outperforms state-of-the-art Vision-Language-Action baselines, achieving the best results across all tasks. The system remains reliable in unstructured environments characterized by heavy clutter, frequent occlusions, and contact-rich manipulation, where reactive policies fail. These results demonstrate that world-model-based planning can operate reliably in complex industrial environments.

Cortex 2.0：将世界模型落地于真实工业部署的实践探索

Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

摘要

Support