HyCodePolicy: マルチモーダル監視と意思決定のためのハイブリッド言語コントローラを備えたエンボディードエージェント

要旨

マルチモーダル大規模言語モデル（MLLMs）の最近の進展により、エンボディドエージェントにおけるコードポリシー生成のためのより豊かな知覚的基盤が可能となった。しかし、既存のシステムの多くは、タスク完了中にポリシー実行を適応的に監視し、コードを修復する効果的なメカニズムを欠いている。本研究では、HyCodePolicyを紹介する。これは、コード合成、幾何学的基盤、知覚的監視、および反復的修復をエンボディドエージェントのための閉ループプログラミングサイクルに体系的に統合するハイブリッド言語ベースの制御フレームワークである。技術的には、自然言語の指示が与えられると、本システムはまずそれをサブゴールに分解し、オブジェクト中心の幾何学的プリミティブに基づいた初期の実行可能プログラムを生成する。次に、プログラムはシミュレーション内で実行され、視覚言語モデル（VLM）が選択されたチェックポイントを観察して実行失敗を検出し、その位置を特定し、失敗の原因を推論する。プログラムレベルのイベントを捕捉する構造化された実行トレースとVLMベースの知覚的フィードバックを融合させることで、HyCodePolicyは失敗の原因を推論し、プログラムを修復する。このハイブリッドな二重フィードバックメカニズムにより、最小限の人間の監督で自己修正型のプログラム合成が可能となる。我々の結果は、HyCodePolicyがロボット操作ポリシーの堅牢性とサンプル効率を大幅に向上させ、マルチモーダル推論を自律的意思決定パイプラインに統合するためのスケーラブルな戦略を提供することを示している。

English

Recent advances in multimodal large language models (MLLMs) have enabled richer perceptual grounding for code policy generation in embodied agents. However, most existing systems lack effective mechanisms to adaptively monitor policy execution and repair codes during task completion. In this work, we introduce HyCodePolicy, a hybrid language-based control framework that systematically integrates code synthesis, geometric grounding, perceptual monitoring, and iterative repair into a closed-loop programming cycle for embodied agents. Technically, given a natural language instruction, our system first decomposes it into subgoals and generates an initial executable program grounded in object-centric geometric primitives. The program is then executed in simulation, while a vision-language model (VLM) observes selected checkpoints to detect and localize execution failures and infer failure reasons. By fusing structured execution traces capturing program-level events with VLM-based perceptual feedback, HyCodePolicy infers failure causes and repairs programs. This hybrid dual feedback mechanism enables self-correcting program synthesis with minimal human supervision. Our results demonstrate that HyCodePolicy significantly improves the robustness and sample efficiency of robot manipulation policies, offering a scalable strategy for integrating multimodal reasoning into autonomous decision-making pipelines.

HyCodePolicy: マルチモーダル監視と意思決定のためのハイブリッド言語コントローラを備えたエンボディードエージェント

HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents

要旨

Support