HyCodePolicy：面向具身智能体的混合语言控制器，用于多模态监控与决策

摘要

近期，多模态大语言模型（MLLMs）的进展为具身代理的代码策略生成提供了更丰富的感知基础。然而，现有系统大多缺乏有效机制来在任务执行过程中自适应地监控策略执行并修复代码。本研究中，我们提出了HyCodePolicy，一种基于混合语言的控制框架，它系统地将代码合成、几何基础、感知监控及迭代修复整合到一个闭环编程循环中，专为具身代理设计。技术层面，给定一条自然语言指令，我们的系统首先将其分解为子目标，并生成一个基于对象中心几何原语的初始可执行程序。随后，该程序在仿真环境中执行，同时，一个视觉-语言模型（VLM）监控选定检查点，以检测并定位执行失败，推断失败原因。通过融合捕捉程序级事件的结构化执行轨迹与基于VLM的感知反馈，HyCodePolicy推断失败原因并修复程序。这种混合双反馈机制实现了在最少人工监督下的自我纠正程序合成。我们的结果表明，HyCodePolicy显著提升了机器人操作策略的鲁棒性和样本效率，为将多模态推理整合到自主决策流程中提供了一种可扩展的策略。

English

Recent advances in multimodal large language models (MLLMs) have enabled richer perceptual grounding for code policy generation in embodied agents. However, most existing systems lack effective mechanisms to adaptively monitor policy execution and repair codes during task completion. In this work, we introduce HyCodePolicy, a hybrid language-based control framework that systematically integrates code synthesis, geometric grounding, perceptual monitoring, and iterative repair into a closed-loop programming cycle for embodied agents. Technically, given a natural language instruction, our system first decomposes it into subgoals and generates an initial executable program grounded in object-centric geometric primitives. The program is then executed in simulation, while a vision-language model (VLM) observes selected checkpoints to detect and localize execution failures and infer failure reasons. By fusing structured execution traces capturing program-level events with VLM-based perceptual feedback, HyCodePolicy infers failure causes and repairs programs. This hybrid dual feedback mechanism enables self-correcting program synthesis with minimal human supervision. Our results demonstrate that HyCodePolicy significantly improves the robustness and sample efficiency of robot manipulation policies, offering a scalable strategy for integrating multimodal reasoning into autonomous decision-making pipelines.

HyCodePolicy：面向具身智能体的混合语言控制器，用于多模态监控与决策

HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents

摘要

Support