다중 모달 AI 에이전트를 위한 예측적 계획

초록

최근 멀티모달 에이전트의 발전으로 컴퓨터 사용 상호작용 및 도구 활용이 개선되었으나, 기존 시스템 대부분은 여전히 반응형에 머물러 단일 행동을 최적화할 뿐 미래 상태나 장기 목표에 대한 추론을 수행하지 않습니다. 이는 계획의 일관성을 제한하고 에이전트가 고수준 다단계 작업을 안정적으로 해결하는 것을 방해합니다. 본 연구에서는 실행 전 단기 궤적 예측을 통해 예측적 추론을 명시적으로 학습하는 2단계 강화학습 프레임워크인 TraceR을 소개합니다. 첫 번째 단계에서는 예측된 행동 시퀀스 전반의 글로벌 일관성을 강화하는 보상으로 궤적 수준 강화학습을 수행합니다. 두 번째 단계에서는 고정된 도구 에이전트의 실행 피드백을 활용하여 단계별 정확도와 실행 가능성을 개선하는 접지된 강화 미세 조정을 적용합니다. TraceR은 온라인 컴퓨터 사용, 오프라인 컴퓨터 사용 벤치마크, 멀티모달 도구 사용 추론 과제를 아우르는 7개 벤치마크에서 평가되었으며, 반응형 및 단일 단계 기준 모델 대비 계획 안정성, 실행 견고성, 일반화 성능에서 상당한 향상을 달성했습니다. 이러한 결과는 예측적 궤적 추론이 복잡한 현실 환경에서 효과적으로 추론, 계획, 행동할 수 있는 멀티모달 에이전트 구축의 핵심 원칙임을 보여줍니다.

English

Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that explicitly trains anticipatory reasoning by forecasting short-horizon trajectories before execution. The first stage performs trajectory-level reinforcement learning with rewards that enforce global consistency across predicted action sequences. The second stage applies grounded reinforcement fine-tuning, using execution feedback from frozen tool agents to refine step-level accuracy and executability. TraceR1 is evaluated across seven benchmarks, covering online computer-use, offline computer-use benchmarks, and multimodal tool-use reasoning tasks, where it achieves substantial improvements in planning stability, execution robustness, and generalization over reactive and single-stage baselines. These results show that anticipatory trajectory reasoning is a key principle for building multimodal agents that can reason, plan, and act effectively in complex real-world environments.

다중 모달 AI 에이전트를 위한 예측적 계획

Anticipatory Planning for Multimodal AI Agents

초록

Support