구조화된 반성을 통한 컴퓨터 제어를 위한 제로샷 언어 에이전트

초록

대규모 언어 모델(LLM)은 실시간 컴퓨터 환경(예: MiniWoB++)에서 고수준 목표를 계획하고 실행하는 능력이 점차 향상되고 있습니다. 최근 연구에서는 특정 작업을 수행하기 위해 모델이 지도 학습 또는 소수/다수 샷 프롬프트를 통해 해당 작업의 실행 예제를 학습하도록 요구하는 경우가 많습니다. 이러한 실행 예제가 없을 경우, 에이전트가 컴퓨터를 자율적으로 제어하고 개선하는 방법은 여전히 어려운 과제로 남아 있으며, 이는 에이전트가 새로운 작업을 수행하는 능력을 제한합니다. 우리는 이 문제를 전문가의 실행 예제 없이도 작동하는 제로샷 에이전트로 접근합니다. 우리의 에이전트는 부분적으로 관찰 가능한 환경에서 실행 가능한 동작을 계획하고, 자기 반성과 구조화된 사고 관리를 통해 실수를 식별하고 학습함으로써 작업을 반복적으로 진행합니다. MiniWoB++의 간단한 작업에서 우리의 제로샷 에이전트는 최신 최첨단(SoTA) 모델을 능가하며 더 효율적인 추론을 보여줍니다. 더 복잡한 작업의 경우, 우리의 반성적 에이전트는 이전 연구들이 전문가의 실행 예제나 추가 화면 정보에 접근할 수 있었던 이점에도 불구하고, 이전 최고 모델과 동등한 성능을 발휘합니다.

English

Large language models (LLMs) have shown increasing capacity at planning and executing a high-level goal in a live computer environment (e.g. MiniWoB++). To perform a task, recent works often require a model to learn from trace examples of the task via either supervised learning or few/many-shot prompting. Without these trace examples, it remains a challenge how an agent can autonomously learn and improve its control on a computer, which limits the ability of an agent to perform a new task. We approach this problem with a zero-shot agent that requires no given expert traces. Our agent plans for executable actions on a partially observed environment, and iteratively progresses a task by identifying and learning from its mistakes via self-reflection and structured thought management. On the easy tasks of MiniWoB++, we show that our zero-shot agent often outperforms recent SoTAs, with more efficient reasoning. For tasks with more complexity, our reflective agent performs on par with prior best models, even though previous works had the advantages of accessing expert traces or additional screen information.

구조화된 반성을 통한 컴퓨터 제어를 위한 제로샷 언어 에이전트

A Zero-Shot Language Agent for Computer Control with Structured Reflection

초록

Support