HyCodePolicy:面向具身代理的多模态监控与决策的混合语言控制器
HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents
August 4, 2025
作者: Yibin Liu, Zhixuan Liang, Zanxin Chen, Tianxing Chen, Mengkang Hu, Wanxi Dong, Congsheng Xu, Zhaoming Han, Yusen Qin, Yao Mu
cs.AI
摘要
近期多模態大型語言模型(MLLMs)的進展,為具身代理中的代碼策略生成提供了更豐富的感知基礎。然而,現有系統大多缺乏有效機制來適應性地監控策略執行並在任務完成過程中修復代碼。在本研究中,我們提出了HyCodePolicy,這是一種基於混合語言的控製框架,它系統性地將代碼合成、幾何基礎、感知監控和迭代修復整合到具身代理的閉環編程循環中。技術上,給定一個自然語言指令,我們的系統首先將其分解為子目標,並生成一個基於對象中心幾何原語的初始可執行程序。該程序隨後在模擬中執行,同時視覺語言模型(VLM)觀察選定的檢查點以檢測和定位執行失敗並推斷失敗原因。通過融合捕捉程序級事件的結構化執行軌跡與基於VLM的感知反饋,HyCodePolicy推斷失敗原因並修復程序。這種混合雙重反饋機制實現了在最小人力監督下的自我校正程序合成。我們的結果表明,HyCodePolicy顯著提高了機器人操作策略的魯棒性和樣本效率,為將多模態推理整合到自主決策管道中提供了一種可擴展的策略。
English
Recent advances in multimodal large language models (MLLMs) have enabled
richer perceptual grounding for code policy generation in embodied agents.
However, most existing systems lack effective mechanisms to adaptively monitor
policy execution and repair codes during task completion. In this work, we
introduce HyCodePolicy, a hybrid language-based control framework that
systematically integrates code synthesis, geometric grounding, perceptual
monitoring, and iterative repair into a closed-loop programming cycle for
embodied agents. Technically, given a natural language instruction, our system
first decomposes it into subgoals and generates an initial executable program
grounded in object-centric geometric primitives. The program is then executed
in simulation, while a vision-language model (VLM) observes selected
checkpoints to detect and localize execution failures and infer failure
reasons. By fusing structured execution traces capturing program-level events
with VLM-based perceptual feedback, HyCodePolicy infers failure causes and
repairs programs. This hybrid dual feedback mechanism enables self-correcting
program synthesis with minimal human supervision. Our results demonstrate that
HyCodePolicy significantly improves the robustness and sample efficiency of
robot manipulation policies, offering a scalable strategy for integrating
multimodal reasoning into autonomous decision-making pipelines.