HyCodePolicy: 다중 모드 모니터링 및 의사결정을 위한 하이브리드 언어 컨트롤러를 갖춘 구현 에이전트

초록

최근 멀티모달 대형 언어 모델(MLLMs)의 발전으로 인해, 구현된 에이전트에서 코드 정책 생성에 더욱 풍부한 지각적 기반이 가능해졌다. 그러나 대부분의 기존 시스템은 작업 완료 과정에서 정책 실행을 적응적으로 모니터링하고 코드를 수리하는 효과적인 메커니즘을 갖추지 못하고 있다. 본 연구에서는 HyCodePolicy를 소개한다. 이는 코드 합성, 기하학적 기반, 지각적 모니터링 및 반복적 수리를 구현된 에이전트를 위한 폐쇄 루프 프로그래밍 사이클에 체계적으로 통합한 하이브리드 언어 기반 제어 프레임워크이다. 기술적으로, 자연어 명령이 주어지면, 우리의 시스템은 이를 하위 목표로 분해하고 객체 중심의 기하학적 기본 요소에 기반한 초기 실행 가능한 프로그램을 생성한다. 이 프로그램은 시뮬레이션에서 실행되며, 비전-언어 모델(VLM)이 선택된 체크포인트를 관찰하여 실행 실패를 감지하고 위치를 파악하며 실패 원인을 추론한다. 프로그램 수준의 이벤트를 포착하는 구조화된 실행 트레이스와 VLM 기반의 지각적 피드백을 융합함으로써, HyCodePolicy는 실패 원인을 추론하고 프로그램을 수리한다. 이 하이브리드 이중 피드백 메커니즘은 최소한의 인간 감독으로 자가 수정 프로그램 합성을 가능하게 한다. 우리의 결과는 HyCodePolicy가 로봇 조작 정책의 견고성과 샘플 효율성을 크게 향상시키며, 자율 의사결정 파이프라인에 멀티모달 추론을 통합하기 위한 확장 가능한 전략을 제공함을 보여준다.

English

Recent advances in multimodal large language models (MLLMs) have enabled richer perceptual grounding for code policy generation in embodied agents. However, most existing systems lack effective mechanisms to adaptively monitor policy execution and repair codes during task completion. In this work, we introduce HyCodePolicy, a hybrid language-based control framework that systematically integrates code synthesis, geometric grounding, perceptual monitoring, and iterative repair into a closed-loop programming cycle for embodied agents. Technically, given a natural language instruction, our system first decomposes it into subgoals and generates an initial executable program grounded in object-centric geometric primitives. The program is then executed in simulation, while a vision-language model (VLM) observes selected checkpoints to detect and localize execution failures and infer failure reasons. By fusing structured execution traces capturing program-level events with VLM-based perceptual feedback, HyCodePolicy infers failure causes and repairs programs. This hybrid dual feedback mechanism enables self-correcting program synthesis with minimal human supervision. Our results demonstrate that HyCodePolicy significantly improves the robustness and sample efficiency of robot manipulation policies, offering a scalable strategy for integrating multimodal reasoning into autonomous decision-making pipelines.

HyCodePolicy: 다중 모드 모니터링 및 의사결정을 위한 하이브리드 언어 컨트롤러를 갖춘 구현 에이전트

HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents

초록

Support